This article is second of a multi-part blog Search for Big Data and Big Data for Search by Paul Nelson, Chief Architect at Search Technologies.
Realization #2: Big Data will make search better and easier
Much better. Much easier.
Truly, Search was Big Data before there was Big Data. Search has always been concerned with extremely large datasets, and statistical analysis of those sets, both for indexing (i.e. large scale batch processing) as well as at query-time (i.e. high speed real-time processing). Billion-document databases have existed in search engines for decades.
But what Search has always lacked is a good, inclusive framework. Do we have that framework now? No. But we have the vision of the framework, and it lives on Big Data.
But what would we do with such a framework? Well, all sorts of wonderful, amazing things that we’ve always wanted to do, but which were too hard, too expensive, and too unreliable to be used except in complex, hand-crafted implementations. Things like:
- Link Counting / Page Rank / Anchor Text / Popularity
These are all techniques using external references to improve search relevancy. Link counting uses inbound links into a document (i.e. links from other documents) to boost relevancy. Google's "PageRank" has the same goal, but is mathematically more sophisticated. "Anchor text" can influence relevancy by taking into account how other people reference your document. Popularity boosts documents that are known to be more popular based on how often those documents are clicked.
How many implementations of these techniques have I seen over the years? Lots. And all of them have been terrible, hand-crafted, slow, awkward algorithms, which live inside document processing pipelines using relational databases.
But now, with Big Data, these algorithms become more than just possible; they are a natural evolution. After all, Google invented Map/Reduce specifically to handle Page Rank calculations for Google.com. It is because of the lack of an appropriate framework that all of this incredibly valuable external data is underused in most search implementations.
- Something Better than TF/IDF (please!)
TF (term frequency) - IDF (Inverse Database Frequency) has been around for some 30-40 years, is still being used by search engines today, and let’s be honest, it is a total hack.
Oh, it’s a good hack; don’t get me wrong. It makes intuitive sense. Documents with more mentions of the query term are better, right? Query terms that occur less frequently are more useful, right? But the devil is in the details, and there is no solid mathematical foundation for TF-IDF. It is used by search engines and document scoring methods simply because it is easy to compute. Basically, it uses data that is just lying around inside the index.
And today, there are better alternatives! Why not use the real information gain formula (rather than a fake one, which is what TF-IDF is), to compute term weights? It is mathematically much more rigorous. How about creating a predictor for relevancy? I did this for my own search engine in 1992 in preparation for TREC-2. Why isn’t everyone doing this by now? Not only does it return a true normalized percentage score (from 0 to 100), but it also estimates the probability that a document is relevant. Wouldn’t that be insanely useful?
The reason why we don’t have better scoring algorithms is, of course, that there is no good framework for computing these large, mathematically challenging algorithms. Big Data is here to save the day.
- Engine Scoring
Tell me: How accurate is your search engine?
It is beyond shocking that almost no one knows the answer to this question. Of the 500+ customers that Search Technologies has, there are only a handful that have done truly rigorous search engine scoring.
Lots of search engines produce query statistics, such as the number of queries which return zero results, top queries, or top query terms. But do these identify in any way, whether the user is actually finding documents that they need? Of course not.
It’s ridiculous that such an important technology is lacking solid, repeatable metrics analysis. Why? Because it’s so hard to implement. It requires the expensive analysis of enormous log files, and/or manual relevancy judgments. If you further want to improve your search engine (and test it before the new version goes to production), you’ll need a method for executing hundreds of thousands, if not millions of queries, to gather your data.
Once again, we’ve been missing a framework for engine scoring, and Big Data provides the perfect solution. When we can feed query logs and click logs into Big Data, we now have a statistically valid method for evaluating search engine accuracy. And once we can measure search engine accuracy, we can iteratively improve it, leading to a continual, virtuous cycle that will gradually improve the engine to meet business needs.
And if it’s easy, as in point-and-click easy, then soon Google-quality search will be available to everyone.
- Semantic Processing
Search Engines are all about text processing and token matching - finding words and phrases as they are specified in the text. But whatever happened to semantics? Understanding the actual meaning of the content, and the author's intention?
Latent semantics has been around for more than 25 years. I wrote my first word-meaning disambiguation algorithm using WordNet, in 1993. These algorithms have never made it into mainstream search engines. Why not?
When we start to combine semantic processing with world knowledge, that’s when things get interesting. But where do we get world knowledge? How about the semantic web? DBpedia, freebase, GeoNames, Wikilinks... These are just a few of the databases available today, which begin to give us some machine-readable world knowledge, of sufficient depth to do interesting things with, such as word meaning disambiguation.
One of the problems is agility. Semantic processing requires a lot of work and a lot of processing. Doing it excellently is enormously expensive, and requires hundreds of iterations, each of which may take weeks.
Big Data helps us to break down those agility barriers. Experimentation and testing can now be done in hours rather than days. If we can create re-usable algorithms and libraries, for text mining and semantic analysis, and put them into a framework for easy re-use, then the stars start to align, and perhaps true semantic search will become a first-class operation.
- And lots more
There are plenty more algorithms we’d love to create, but have been hampered by the implementation cost. Things like index-time spell correction (especially important for small documents such as business listings, or classified ads), common paragraph removal (especially important when crawling web sites), presentation-influenced word weighting (not all words are equally important in a document), document summarization, sentiment analysis, and truly accurate topic clustering (using algorithms such as Latent Dirichlet Allocation).
I know what you’re saying: This is all great, but where is my page-rank algorithm? Where is my Engine Score?
Patience. Big Data is still undergoing a lot of churn right now. There are many commercial versions of Hadoop including Cloudera, HortonWorks, and MapR. MapReduce is giving way to YARN. Apache Spark has graduated from incubator, etc. So it’s been difficult to settle on the right technology stack.
And Big Data today is still under the control of the “Real Programmers,” and is not yet point-and-click. But that’s just a matter of time. At Search Technologies, our goal is to put together Content Acquisition, Big Data, and Search into a holistic point-and-click framework. If we can do this, then we lower the cost of achieving high quality, and suddenly kick-ass, magical search solutions become both real and possible… for the rest of us.
Realization #3: Search beats SQL