Different Types of "Hard" in Search
I recently relived my past life as a search engine developer through a class I gave at our company kickoff called “Search Engine Internals”. The goal of this class was to give our engineers an insider’s view of how search engines are developed and all of the trade-offs which go into creating a brand-new search engine from scratch.
Search Engines are Hard
It was cool, actually. Everyone started with nothing, and by the end of the day we had written 12 new search engines which did indexing, relevancy ranking, Boolean query expressions, and phrase searches. They were excellent students!
Preparing and teaching the class brought back memories of what it was like to be a search engine developer, which I did full-time from 1989 to 2000. And it also brought back that thrill you get when entering a query into your own hand-built engine and get search results – it’s a rare and precious experience. I hope my students felt a little of that tingle.
Taking this trip through memory lane has made me nostalgic for those days, but has also made me realize how much my perspective has changed since then. I look at the problems we’re solving today and I compare them to the problems which we solved then, and I’ve come to the realization that there are many different types of ‘hard’, and further, some types of hard are harder than other types.
Programming search engines is hard, no doubt about it. They have to scale to millions (maybe billions) of documents. They have to compute complex relevancy algorithms across very large data sets. They have to handle the “long tail of search” (i.e. simultaneously handling those few terms which are in millions of documents along with the millions of terms which are only in a few documents), they have to be disk based for those situations when index sizes exceed available RAM, and they have to handle many different complex data structures and operations.
But actually, they’re hard only in a very narrow and specific way. They take well defined inputs (sets of tokens) and produce very well defined outputs (sorted lists of document identifiers). They can carefully control their inputs and outputs and they can fully define the boundaries over which they operate.
In lots of ways, solving a search engine is a lot like solving Fermat’s Last Theorem (No three positive integers a, b, and c can satisfy the equation an + bn = cn for any integer value of n greater than two). It’s a small, well-defined problem. Yes, the solution can take many years and require many, many lines of code, but 1) there is no doubt about what is being solved, and 2) it can be done by a small team of programmers with little outside interaction.
But There are Other Types of Hard
As we went through Search Engine Internals training, I realized that creating a search engine is actually an easy problem, in comparison to solving the problem of document processing and data preparation.
It’s like the difference between a rifle and a shot-gun. A search engine is like a rifle. It is a narrow problem that is difficult but narrowly focused.
Document processing, on the other hand, is more like a shotgun. It must handle a very wide number of data sources, metadata formats, document formats and structures, encodings, and security models. And new ones are being created every day! Architecting a system which does this well is extremely difficult – more difficult than creating a search engine, I believe.
After all, document processing requires handling many different databases (Oracle, MySql, SQL Server, Derby), content management systems (SharePoint, FileNet, Documentum, Alfresco, Drupal), access methods (web services, http, files, JDBC, JMS, custom APIs), file structure formats (XML, JSON, PDF, Microsoft Word, Microsoft Excel, Microsoft PowerPoint, hundreds upon hundreds of other file formats), and encodings (ASCII, UTF-8, Windows 1252, etc.).
Add to this the issues of security handling (dozens of different methods for specifying users, groups, roles, and access control lists), language handling (differences in suffix stripping, decompounding, dictionary-based tokenization), custom data sources, and thousands upon thousands of different metadata structures, and you have a very, very broad problem indeed.
But which Hard is Harder?
What I’ve come to realize is that document processing requires substantially more maturity of the overall organization than is required by search engine development. Therefore, in many ways, document processing is quite a bit more difficult than search engine development.
Let’s put forth a hypothetical situation to test this theory. Suppose we gather two groups of 10 programmers each into two different rooms.
The first group gets this problem: “Develop a search engine which takes Documents D1 through Dn, each one containing tokens Td1 through Tdn. Take a query which has tokens Tq1 through Tqn made up of a query expression with operators. Now implement an engine that returns a sorted list of documents which match the query expression.”
The second group gets this problem: “Develop a system which can talk to any type of database, content management system, file system, or other source of data, handle any type of metadata or metadata processing, handle any type of file format, handle any language on the planet, and any type of security system, now or in the future.”
And the funny thing is, I’ve actually been in both of these situations. I was one of the 10 programmers in the room when both of these problems were posed.
When I used to work on RetrievalWare, we were completely focused on the first problem. We discussed token sets, set math, query processing, scalability, and on and on. It was a very narrow problem and we were able to fully define our inputs and outputs. Sure, there were hundreds of thousands of lines of C code. But that code could be divided up into logical pieces, each programmer could work as part of a small, focused team, and the overall goal was very well defined.
Later, I went to work on some large government projects which had hundreds of different document collections, metadata structures, access methods, and document processing functions to perform. It was, essentially, the second problem.
It turned out that answering the second problem was much more difficult than the first. Not only was it simply a difficult problem, but also no one could agree on the boundaries or scope of the problem, or even if there should be boundaries at all. In fact, the second question occupied a substantial team of developers for months and months without producing any sort of workable solution. It was only after we redefined the problem to include necessary business processes, that we were able to finally solve it.
There are many other examples of very hard, very broad problems which require integrated business processes. Similar problems include language processing (such as provided by BasisTech), entity extraction, and document text extraction. In lots of ways, these problems are much more difficult than text search engines because they all need to be integrated with comprehensive business processes and continuous refinement and update cycles.
And Text Analytics is yet Another Type of Hard
The second class I gave at the conference was “An Introduction to Text Analytics”. Since this was the second class, I was already thinking along the lines of “types of hard”, and that’s when I realized that Text Analytics is yet another. This is because Text analytics requires keeping two competing trains of thought inside your head at the same time, namely the competing goals of “global” and “local” meaning.
In Text Analytics, one is simultaneously concerned with getting each individual token inside each document as accurate and precise as possible (local meaning) and at the same time determining the “overall meaning” of entire documents and entire collections (e.g. categories) of documents. And so, Text Analytics quickly becomes an issue of prioritization and method of attack.
Local meaning in Text Analytics is entirely about cleaning up and handling endless thousands of small details. For example, how should one handle capitalization? When does capitalization affect meaning? When does punctuation affect meaning? When are “-“ and “/” important to retain, and are they best to be removed? When are stop words too generic to be useful and when do they affect the meaning of the tokens to which they are attached? Should one try to disambiguate a term based on local meaning? Should we try to tell the difference between the word “general” as a military rank and when it means “general purpose”?
Ironically, many of the answers to these questions may require data derived from global analysis. For example, we can use term clusters based on latent semantic analysis to disambiguate term meanings, or word frequency statistics to do statistical part of speech tagging.
Global meaning, on the other hand, is concerned with large bodies of text. This means extracting some sense of meaning of entire paragraphs, sections, documents, or groups of documents based on an aggregation of the meanings of all of the individual tokens and/or sections involved.
And so global meaning is best accomplished with large statistical processes, such as document vector generation, vector comparison, clustering, and so on. These processes deal with large amounts of text and produce abstract numbers, such as term weights and similarity scores. Verifying these numbers is typically accomplished with some sort of engine scoring mechanism and iterative refinements of parameters and algorithms.
And what’s interesting about Text Analytics is that one can easily get “stuck” in the local or “stuck” in the global. Sometimes it feels a lot like visiting the land of the Lotus Eaters; one can get too comfortable working at one level or the other. For example, a programmer can easily spend all of their time tweaking token processing and token meaning and endlessly refining tokens without considering how much impact their changes will ultimately have on the overall accuracy of the system. Similarly, one can spend endless cycles trying various aggregation and similarity formulas to incrementally improve overall performance – getting stuck making changes which produce statistically insignificant advances.
And so, Text Analytics requires an odd combination of foci – a bipolar focus, if you will. You need to focus simultaneously on the global (overall meaning) and the local (syntax) to make progress. Determining what to change globally is often best determined by opening up and reading documents which are either especially good or especially bad – essentially using a “local analysis” across many individual documents to identify what algorithms should be used for global statistical aggregation. And similarly, deciding how to do improvements to token processing should constantly reference global analyses. Global aggregation can help determine which local problems should be addressed first and how.
Problems which Touch the Universal
All of this has got me thinking about “Problems which Touch the Universal”. Perhaps another way to state this is that there are some problems which are closed (such as Fermat’s Last Theorem) and other problems which are open (such as token processing).
Closed problems have well defined inputs and outputs and live within a closed system which can be reduced to a problem without reference to the outside world. Examples of closed problems include things like core search engines algorithms, regular expression matching, finding roots of formula, relational database SQL optimization, and so on. Most computer applications are also, in and of themselves, closed problems as well. For example, inventory, accounting, e-commerce, manufacturing, modeling ballistic trajectories, etc.
Open problems, by comparison, must “touch the universal” and must be designed and approached from the very beginning with an understanding that you can never solve the complete problem and that all solutions are, at best, an approximation applicable for a particular point in time.
Wikipedia is an excellent example of an open problem. Creating an encyclopedia which condenses the entire sum of human knowledge is most certainly an endeavor which will never be complete. And designers of Wikipedia understood that this was the case and invented methods, communities, processes, etc. which will continuously refine and extend this global knowledgebase for the foreseeable future.
Imagine one person trying to write Wikipedia? I am reminded of Wowbagger from The Hitchhiker's Guide to the Galaxy, who insists on insulting every living sentient being in the universe, in alphabetical order. Now that’s an open problem if there ever was one.
Similarly, document processing is one of those open problems which seems like it should be easy, but because it’s involved with so many different systems, metadata structures, file formats, applications, etc., and more are being created every day, one tends to feel a little like Wowbagger. Just this month Search Technologies has implemented RightNow, Livelink, and Liferay connectors. These are systems which I had never encountered until just recently.
And Text Analytics is clearly an open problem as well. Since meaning touches on every aspect of human communication, both at the local level and global level, we can never have a “perfect” Text Analysis tool. It will always, of necessity, be a system that requires continuous refinement.
And even text search relevancy is an open problem. After all, relevancy ranking – “Is the document a good match?” – touches on meaning which depends on the data structure, the usage of the terms, and the infinite variety of how people express themselves, both as writers and searchers.
At Search Technologies we are always encountering open problems, which is why I love it so much. I get quickly bored of technology. Open problems, on the other hand, force you to consider the universal in order to solve them. And the solutions, when they come, often tell you just as much about your organization, business processes, and ability to tackle abstract and conceptual issues, as they do about mere technical prowess.