Why Text Search Programming is not like Regular Programming
I remember an interview with Matt Birk, Harvard educated center for the Baltimore Ravens (that's an American Football team, for my friends across the pond, and the center is one of the big guys in the middle). He commented that, at the Ravens, all they talked about was football, from the moment they arrived at work to the moment they went home – and what a great environment that was.
This is how I feel about Search Technologies. I've been in the text search business for 22 years, and to be in a place where I come to work and all we talk about is text search all day long – it's truly a blessing.
I think this is why I am sometimes taken aback when I hit up against programmers and systems engineers from the "regular world". "Why don't they get it?" I ask myself. But of course they don't, not due to any lack of intelligence or insight, but simply because they don't yet realize that text search programming is not like regular programming.
When I hear statements like: "It's essentially an ETL process", "It's just XML", "I've copied everything into this database table", and "Why not use Hibernate and Spring to solve all these problems?" That's when I know that I've encountered someone from the regular world of programming who will probably not understand the special needs of text search until they've suffered through a full-on search project of their own.
So what’s different about text search programming?
First, you must be tolerant of data variation. In text search, you can be dealing with tens of millions of documents written by actual people. This is not data entered into forms and then validated by a relational database – these are un-structured documents. The amount of variation is beyond all standard expectation. Broken XML, illegal encodings, binary files posing as HTML, stray XML tags, Japanese text in the middle of your document, titles specified 8 different ways – the amount of variation is endless. Imagine that your grandmother, ex-girlfriend, and grade school teacher all wrote documents for your database, how similar would they be?
And so, text search programmers need to create architectures that are exceptionally robust in the face of unexpected variations in data. Common practices include quarantine procedures, data fallbacks, and endless cycles of data validation. Things to avoid are "regular programming" practices such as batch processing (one bad document should not ruin an entire batch!), document imports into structured databases (how can you load something which can vary?) or any technology which requires any sort of fixed, up-front, data structure definition.
Second, text search programmers should embrace uncertainty. How does one design a system where so much is uncertain and where data formats, structures and user requirements are constantly evolving? Where do you even begin to write requirements?
Text search is all about iterative refinement and creating increasingly rich subsets. Solve as much of the problem as you can, then iterate and start attacking what's left. The text search programmer will design systems with the up-front expectation that multiple approaches will be required.
Text search programmers use techniques which work well in uncertain environments. For example, iterative validation, evaluating statistical samplings of documents, queries, and query tokens, automated regression testing, probabilistic relevancy scoring, agile development, and continuous document and index auditing. Each of these techniques can be used to provide guidance in an uncertain world, targeting resources to those areas where they will do the most good.
Third, tomorrow there will be a new database – either an additional database which must be added to the system, or a completely new source for the existing database. The text search programmer realizes that data sets are in no way fixed, and that last-minute, brand new databases come with the territory. Architectures must allow for the quick and easy addition and substitution of data sources.
For example, one of our customers discovered an RSS feed they had forgotten they had. Metadata could now be garnered from the feed rather than from the content crawling process, as originally implemented, greatly improving the quality of extracted metadata. Fortunately, our document processing framework was able to implement even a dramatic change such as this in just a few hours. A second customer switched data sources half-way through the project. A third customer ended up adding many more data sources to a search system than originally envisaged (they now have over 300 sources).
Finally, text search programmers know how to stay flexible. Flexibility is key. Here are just four examples:
- Don't just write a program to handle a single problem X. Instead, write it to handle an array of problems X[1..N]
- Do you have to change source code to add a new document collection? Don’t. Instead, move the areas which change into configuration files
- Are there central controlling files for system configuration that are being constantly updated as new collections are added? Dice those configuration files into pieces – one per collection – to improve configuration control and increase the independence of collections
- How long is your deployment and test cycle? It needs to be very short, so build that requirement into your design
At Search Technologies we design search architectures that work with any of the leading search products and which allow for significant changes without any new code deployment at all. Even better, new collections can be added and text processing components upgraded dynamically, without a re-start or affecting other processing streams.
Knowing how to stay flexible, and how to do it in a way that remains stable and easily maintainable despite continuous change in the data landscape is something that text search programmers have learned from years (or decades... *sigh*) of joyful work in the realm of unstructured data processing and search.