Relevancy Ranking Course: Part 2
By Paul Nelson, Chief Architect at Search Technologies
This article is Part 2 of a four-part series. Read Part 1 here.
Welcome to Part 2 of Relevancy 301, the graduate level course. Man, I had no idea that there would be so many nerds at the Phi Beta Kappa mixer. It was like Comicon without the costumes. The punch was killer, though.
Anyway, I know that attending an 8am lecture with a hangover is like the 11th circle of hell or something, but I see you have a coffee and a doughnut from Dunkin, so here we go.
An especially powerful relevancy technique is to rank documents higher if query terms are found in “high value” fields of the document. The most common example of this is to rank documents higher if query terms are found in the title of the document as opposed to someplace in the body.
Some engines (e.g. the Microsoft FAST engine) depend so heavily on field weighting that they have special structures in the index for field weighting, called “composite fields.”
Other engines will implement field weighting with term expansion and boosting. For example, suppose that the user’s query is |george washington|. To boost hits which occur in the title over those which occur in the body, the query is rewritten as follows:
(title:george^1.0 OR body:george^0.8) AND
(title:washington^1.0 OR body:washington^0.8)
Note that this query re-write rule guarantees that a document will contain at least one occurrence of “george” as well as at least one occurrence of “washington”, ensuring 100% completeness (see below). Documents where these terms occur in the title will be weighted higher than documents which contain these terms in the body.
More than any other relevancy technique, field weighting has a direct link to user satisfaction, as long as users can actually see the fields in the search results. Remember that users want to know why a document is returned. If they can see evidence of the relevancy of the document in the search results (because their search terms are displayed in the search results, for example), then they will get an intuitive understanding as to what’s going on and will start to trust the engine more and more.
Such preferences are so strong, that search engines which fail this simple test (“why aren’t documents with more hits in the title higher in the search results?”) will be thought of as broken by most users.
Link text is the text which other people use in their HTML <A href=””> tags when they link to your page. Links from other web pages to your document are called inbound links. The theory here is that what others say about your page is generally more reliable (since it comes from a third party) than what you say about your own page.
Link text is very heavily weighted by the Google search engine, so much so, that many documents returned by Google will not actually contain the terms you searched on. The terms will only occur in HTML links from outside pages.
Link text is mentioned here because it is really another type of fielded data, and it is handled much the same way as “Field Weighting” as discussed in the previous section. The basic algorithm goes like this:
- Download billions of web pages and extract all html links.
- For every page, accumulate the link text from all inbound links into a "link text field."
- Index the link text field with the document. Use it for Field Weighted search, as described above.
Easy, right? Of course, steps 1 and 2 are both very difficult, and require massive processing engines to handle the enormous amounts of data. A “Big Data” batch-processing, MapReduce, Hadoop-style engine is recommended.
Unfortunately, Link Text is of limited usefulness outside the World Wide Web. Most corporate intranets, for example, have very little inter-linking (they are mostly Word and PowerPoint documents), and so there is insufficient link text to merit the computational expense. Other types of vertical-domain document sets (directories, classified ads, eCommerce engines, etc.) again have little or no inter-linking and so link text field weighting is pretty much useless for these applications.
Completeness measures how many of the query terms entered by the user actually exist in the document. It is typically represented as a percentage. For example, 80% would be used to indicate that 4 of the 5 user’s query terms exist in the document.
More sophisticated versions of completeness will compute percentages based on term weight. For example, suppose I have two query terms, “A” with a weight of 0.5 and “B” with a weight of “1.5”. In this example, the completeness for a document with only term “A” will be 0.5 / 2 = 25%.
Completeness is one of the most underappreciated relevancy statistics - so much so that it is rarely if ever explicitly discussed, perhaps because it hasn’t been officially named (“completeness” is my own moniker for this statistic, which I started using in the early 1990’s).
What is certain is that completeness is the most important statistic to the end user. Users expect returned documents to contain all of the words from their query, and if they don't, the user will think that the search engine is broken. This is such a strong desire, that most professional engines only return documents which have 100% completeness.
But not only do users demand 100% completeness, they also want the search engine to be obvious about it. This is accomplished in modern engines with dynamic teasers, which show snippets (lines of text) from document in the search results with the user’s query terms highlighted.
Completeness and the Zero Percentage
One of the consequences of a strict adherence to 100% completeness is that searches will more frequently return zero results. While zero results is actually important information to most users, it can be looked upon as a problem especially when the percent of queries which return zero results (also called the “zero percentage”) is too high, say more than 30% of all queries. This can be a particularly acute concern for eCommerce sites, where revenue is tied to getting the product in front of the customer. After all, if they don’t see the product, they can’t purchase it.
Naïve attempts to fix this problem will often resort to reducing completeness, so that documents with completeness percentages of less than 100% are returned. A simple ‘fix’, which I’ve actually seen in several implementations, is to replace the “default AND” operator in queries with “default OR”. While this ‘fixes’ the zero percentage problem, it does so at the expense of offending the user.
The problem with “default OR” is that queries are often made up of one common term and one rare term; an example might be: |capillary action| where the rare term, “capillary” is most likely the reason why no documents are returned. If instead the engine returns all documents with just |action|, the results are useless and a waste of the user’s time as the user must figure out exactly why the results are terrible.
Better solutions to reduce the zero percentage rate without sacrificing completeness will require modifying the query. For example, acronyms can be expanded (“SYTYSTAFG” to “So You Think You’re Smarter than A Fifth Grader”), common variations can be substituted (“William Clinton” for “Bill Clinton”), spell correction can correct misspellings (many zero-result queries are the result of poor spelling), words can be expanded to synonyms (“house” for “domicile”), alternative representations can be included (“CO2” for “carbon dioxide”) or even entirely different types of searches can be executed (switching to pattern search, for example, which can be especially useful if the indexed document data contains many misspelled or unusual words). Choosing which techniques to apply to achieve the most gain with the least amount of expense requires careful analysis of query logs.
In all these cases, one must be careful to not lose the user. If, at any time, the search engine returns something that the user cannot easily understand, the connection with the user will be lost and they will stop using the engine. Google, for example, is careful to tell the user “Searching for ‘van gogh’ instead of ‘van gough’, click here to search for ‘van gough’ instead” and similar user interface clues to ensure that the user is always kept informed as to what is going on every step of the way.
[As an aside, I find that the user’s search experience is a lot like the experience that one gets when going to the theater. If things are too confusing or too stupid (trite, smarmy, old fashioned, whatever) during a play, you lose the audience. They just switch off and leave at intermission.
Search engines are exactly the same. Like attending a play, using a search engine should not require a user's manual. The engine itself must lead the user through the story in such a way that the user is kept informed and engaged as to what’s going on at all times. This can only happen if the engine, the data, the index structure, and the end-user interface are all working together in perfect synchronicity. Again, this is like the theater, where the director, the playwright, the cast, the set designer, the costume designer, the lighting designer etc. must all be working together to create a true work of art.]
Completeness Should Impact Other Statistics
Finally, I need to point out that completeness can and should come into play with other relevancy statistics. For example, when computing Term Frequency (see Part 1), documents that contain all query terms in roughly equal numbers will likely be better than documents that contain a lopsided representation of the terms. Similarly, when computing Field Weighting (see above), documents which contain all of the query terms in the title should be stronger than documents which contain only a few of the terms. And one more example with Proximity (this will be discussed in Part 3), a dense cluster of terms in the document will be of less interest if only some of the query terms are represented in the cluster.
End of Part 2
Well, you made it through your 8am lecture without falling asleep. Bravo for you. Next time we’ll talk about proximity and other statistics which involve knowing the positions of the query terms within the document.
And yes, it will be on the final.