The Magic and Wonder of Query Parsing
How Advanced Query Parsing Techniques Help Improve Search Engine Performance
A query parser, simply put, translates your search string into specific instructions for the search engine. It stands between you and the documents you are seeking, and so its role in text retrieval is vital, wonderful, and often acutely frustrating. Yet a search application cannot reach its peak performance without intelligent query parsing, which allows for relevancy customization, additional security-trimming, and taking input from user interface variables or outside data sources.
Watch my video below for an in-depth illustration of advanced query parsing techniques, or read on to see 5 practical search engine improvements you can make with an advanced query parser.
Some query parsers are very Type-A. That is, they are fanatical in their attention to detail and overly anal retentive. You want all documents with both the words A and B no more than five words apart from each other? Fine. Here they are.
Other query parsers are much more laid back and flexible. You want all documents with both the words A and B? Well, here are some good-looking documents with both, and then some additional ones I thought you might like with only the word A, or B... relax dude.
The best query parsers understand their content and act like expert searchers on the behalf of the user. You want all of the documents with A and B? Here they are, and I've decided to sort them such that documents with A and B in the title are first. You're welcome.
Of course, like animals and their owners, it's best to match a query parser whose personality is most like the anticipated users. The types of queries which are entered by librarians, lawyers, or PhDs will be very different than the types of queries entered by my Aunt Helen and Uncle Eric. Nothing frustrates a user more than a query parser that doesn't speak the same language.
Unfortunately, most text search products don't provide any flexibility in the query language. This is, frankly, a frustrating and glaring lack of foresight. Many projects I've worked on have been forced to write their own query parser. Sometimes they are simple (add ~0.85 to the end of every word), and sometimes they are complex (full boolean languages with nested expressions and range searches). But almost everyone does some preprocessing of the query before submitting it to the search engine.
So what are all the things you can do with a query parser? Well, here's a list:
1. Identify What To Search For
Step one of query parsing is to identify what words get searched. This is not as simple as it might seem.
At a minimum, the query parser will need to identify what is a query term. Is F-150 two terms or just one? Is the dash important? What about F150 ? What about when "F-150" is enclosed in double-quotes? Does that mean something different?
Of course all of these decisions must be made in reference to the indexer. The query parser can only search for terms that the indexer has decided to index. And so you may wish to search for +/- 5 miles, but it is likely that the characters "+/-" are not anywhere to be found in the index.
2. Parse The Query Language Itself
Most obviously, the query parser must parse the query language itself. This includes recognizing and interpreting operators (AND, OR, +, -, NOT, etc.) grouping operators, and field restrictors.
Even google can now execute queries like this:
music -video (singer OR songwriter) site:amazon.com
Depending on the end-user, the query language can either be very simple (just a string of tokens, perhaps with double-quoted phrases) or quite complex (full boolean syntax with nested expressions, proximity operators, and term weighting). Or it can be a frustratingly inconsistent hybrid – like Lucene.
3. Provide Access to Search Engine Features
Some query languages are never intended to be used by end-users. Oracle Text and FAST FQL fall into this category. Instead, these languages are intended to be used by other programmers for accessing all features provided by the search engine.
Ideally, all search features which can be processed by the engine are available somehow, but this is often not the case without customization. For example, there are many (many!) more search features available in the Lucene query engine than can be expressed by the standard Lucene query syntax. If you want to take advantage of these additional features, you are must create your own query parser.
4. Search for Other Things
When you search for "chair" do you also want to find "chairs"? How about when you search for "mouse", should you also find "mice"?
This can get quite complex. For one customer, when you search for "underground mining", you also get "room and pillar mining". Similarly, a search for "blackjack" gives "sphalerite" and "mock lead" (and vice-versa). Such queries are often useful to help naïve end-users search over vertical domains with which they may be unfamiliar.
And don't forget fuzzy searching. If your data is especially dirty, this can be critical. We found 14 variations of the word "Kawasaki" for a customer last week, including things like "akawasaki", "kawasawki", and "hawasaki".
5. Aid in Relevancy Ranking
In the role of translator, query parsers will often make adjustments to the query entered by the user to help retrieve documents better. For example, documents which contain search terms in the title or abstract may be considered to be more relevant.
There are many kinds of relevancy ranking tricks and techniques that a query parser can employ as long as the underlying engine is powerful enough. For example, query parsers can boost documents which contain all of the terms close together (proximity weighting) or boost documents from friendly web sites while reducing documents from un-friendly sites.
If you also have control over how the documents are indexed, then the sky is the limit. You can do sophisticated things such as boosting home pages, boosting documents recently published, boosting documents which contain key domain-specific terminology, or performing fine-grain control over what types of queries are strong hits for each individual document.
For example, on GPO FDsys, we made certain that document citations are the strongest type of search. If you happen to know the number for the health care bill, you can enter the query: "h.r. 3962, 111th congress", and you'll find it right away.
I hope you've enjoyed this little tour of query parsers and what they can do. Personally, I think they're fascinating – like bottled black magic. Creating your own query parser is not for the feint of heart, but if you can do it, you can make your queries dance.