Relevancy Ranking Course: Part 3
By Paul Nelson, Chief Architect at Search Technologies
Every time I teach a course it’s exactly like this. I talk and talk and then wake up at the end of the semester only to discover that I’ve only covered half of the material. Time management expert, I am not.
But thankfully, you do not read this blog to hear about my time management techniques (or lack thereof). You read this blog to learn about text search relevancy ranking. So let’s get to it.
All of the relevancy statistics in this part have to do with term “occurrence” information. In other words, where do terms occur in the document (word position), within the document structure (format boosting), and within the sentence (syntactic boosting)?
Some term occurrence information is captured in other ways. For example, Field Weighting (part 2) understands when terms occur in certain fields (e.g. the title field), and Term Frequency (TF) is a count of the number of occurrences of a term in the document.
The following statistics go further in this dimension, and explore statistics which depend on the exact positions of the terms within a document, and how they are used.
After field weighting and completeness (see part 2), proximity ranking is perhaps the most effective relevancy ranking parameter ever invented. Unfortunately, it is also one of the most difficult to compute.
Proximity ranking will rank documents higher where the query words are found close together (i.e. in close proximity) over documents where the terms occur far apart from each other. This is a common sense relevancy ranking statistic for any query which contains multiple words. It does not help for single-word queries.
For example, if my query is |database development|, the document will be much more relevant if these words are close together than if they are paragraphs or pages apart.
Unfortunately, true proximity ranking is expensive to implement and requires more machine resources (disk, RAM, and CPU). This is because proximity ranking requires knowing the positions of the word within the document, not just merely the presence or absence of the word. This means storing and fetching these positions which makes the indexes bigger, indexing slower, and queries slower.
And so proximity ranking often gets short shrift by engine developers, even though it’s such a powerful statistic. For example, some engines will only boost documents which contain the query as an exact phrase, or if the query contains all of the words within some window. For example, some engines will merely add 25% to the score if all of the query words occur within a 25-word window of each other in the document.
The best algorithms, however, will use a gradated window for proximity weighting, like this:
This window is placed on top of every query word found in a document, and then an average of neighboring query words is computed, each one weighted by the window based on its distance from the center point. This is the proximity score for the center word. The proximity for the document will be the maximum center-word proximity score.
The advantage of this window is that it provides a gradated proximity boost. Documents with the words that are adjacent to each other will receive the highest boost. As the words drift further and further apart, the boost will gradually decrease and the document will gradually become less relevant.
Documents can also be ranked higher if they contain query terms earlier in the document. The theory here is that writers will put the most important information first (for example, in a summary, abstract, or introduction), and less important (or germane) information will be found later in the document.
While this might seem to make sense, our own statistical analysis has shown that term position does more harm than good. Even for relatively structured, time-based text files – like a person’s résumé – term position doesn’t appear to help much.
The only case in which I’ve successfully used term position is searching over well-known titles, such as the names of movies or TV shows. For example, if searching for “star”, the user will generally prefer matches on titles such as “Star Wars” and “Star Trek” over titles like “The Raccoons and the Lost Star”. In narrow situations such as this, term position can be useful, but in most other cases, field weighting is more than good enough.
Format-Based Term Boosting
Format-Based term boosting will boost documents if they contain query words that are emphasized in the document based on the presentation formatting.
For example, one might consider terms in <b>bold</b> to be more important than other terms in the document. Documents which contain query words that are emphasized in these ways (bold, larger font, marked as a section header, etc.) might then be ranked higher than documents where the query words are only found in plain (unadorned) text.
The problem, of course, is how to reliably extract document formatting. To do this reliably, one must be able to parse the document tagging. Extracting formatting even from ordinary HTML text is quite complex since one must understand nested tags and cascading CSS stylesheets. And then there are odd situations to consider, such as big, bold terms within hidden <div> tags, or words which are the same color as the background (a common spammer technique).
If the content is more complex than HTML, such as PDF or Microsoft Office documents, then the difficulties are compounded. Just extracting text from these binary formats is hard enough. Reliably extracting formatting as well is even more difficult and less reliable. Worse, if you extract formatting from one type of document and not another, then you may accidentally bias the statistics in one direction or another – actually reducing your accuracy.
For all of these reasons, to my knowledge no search engine today uses content formatting to affect relevancy scoring. The possible exception might be the Google Search Appliance, which has a more HTML focused pedigree than most other commercial engines. I don’t personally know that the GSA uses format-based boosting, but I suspect that it might.
Syntactic boosting asks how words are used within the sentence. Documents which contain query words used as the subject of the sentence, for example, might be boosted over documents where words are used as modifying clauses.
For example, a query on “star wars” might prefer document A over document B:
Document A: “Star Wars is my favorite movie.”
Document B: “Science fiction movies, such as Star Wars, can be more than just escapist fantasy.”
Syntactic boosting is enormously difficult and of limited usefulness.
First off, syntactic parsing is itself a difficult problem over large bodies of unconstrained natural language text, and may only have a 65% accuracy rate. Second, there’s no hard evidence that words in the subject of the sentence are, in fact, more useful than words elsewhere (consider passive tense). Position within the sentence is likely dwarfed by other more useful statistics, such as simple proximity and term frequency (TF).
To the best of my knowledge, syntactic boosting has only been implemented in academic, proof-of-concept search engines.
End of Part 3
Thanks for sticking with me through this series of blogs on relevancy ranking. As I mentioned, I’m not that great at time management, but I’m pretty sure that there will be only one more part.
Or maybe two.