Relevancy 301 The Graduate Level Course: Part 4
This article is Part 4 of a four-part series. Please read Part 1, Part 2, and Part 3 to catch up in class.
Right now I’m feeling like one of those pathetic graduate students who linger on for years and years as “ABD” (All But Dissertation). You must know one of them. They’re always at some party or another suggesting that everyone takes off from work tomorrow to go to the beach, and you’re thinking: “Dude! Shouldn’t you be at your computer finishing your dissertation?"
So here I am, finally completing “Relevancy 301”. This will be the last part of the series.
Document quality is a compelling relevancy ranking component which serves the foundation for systems as diverse as Amazon.com and Google.com. It states that some documents are “more useful” than others and that more useful documents should show up higher in the search results.
Note that document quality is a “query independent” measure, based entirely on the characteristics of the document itself ignoring how well it matches the user’s query. Typically the document quality score is computed ahead of time when the document is first processed and then stored with along with the document in the index.
Of course, what is “more useful” is a subject of endless debate, statistical analysis, and doctoral dissertations. There are numerous factors which can be used to compute it. Here are a few of the most important:
- Source Quality – Simply assign a quality factor to a content source and then propagate that factor to all of the documents associated with the source.
This is a commonly requested improvement from both enterprise search end users and management. I often hear: “Our searches return pages and pages of customer support tickets. Can’t we just push all those down?” And the answer is ‘yes’, certainly you can. Just be careful that they aren’t pushed so far down the list that no one ever sees them.
- Document Size – In some databases, document size is a good indicator of quality. For example, in Wikipedia, the larger the document (typically) the better it is.
For example, the user could search on the word “Olympics” and get “The Olympic Games” (38 pages long) or the “Detroit Olympics” (minor-league hockey team, 1927-1936, 1 page long). Typically articles of wider interest and history will contain more sections and more discussion.
- Popularity – Popularity is the statistic tied most directly to usefulness. It says that more popular documents should be ranked higher in the results.
Systems will often use log-file analysis to measure popularity, for example by counting the number of clicks on documents over the last 12 months.
- Inbound Links – A very powerful algorithm (when available) is to use “inbound link counting” to determine a document’s popularity. With inbound link counting, documents which are most often referenced by other documents are considered to be the most useful.
Inbound link counting is found in a wide variety of different contexts, such as web search, patent search (patents must reference other patents), academic abstracts (academic articles which are often referenced by other writers are considered to be the ‘seminal’ or most useful articles), and encyclopedias (our Wikipedia search uses inbound link counting as part of the relevancy formula).
- Page Rank – Page Rank is a more statistically-rigorous version of inbound link counting which attempts to compute the probability that a user will randomly land on a page by clicking on links.
With simple inbound link counting (see above), each link counts the same. With PageRank, links from the best pages count for more than links from weak pages. Further, since the page rank is a probability, it stays entirely within 0 to 100%, making it a mathematically easier score for search engines over simple inbound link counting.
- User Ratings – In sites which have user ratings, such as Netflix or Amazon, ratings can be used to influence document quality, since, presumably, a 5-star movie will be more satisfying find to the user than a 1-star movie.
- Freshness – Freshness says that more recent documents (those with younger creation dates) are more useful than older documents. Clearly, this is true for news, and perhaps for product-support databases too. It is less useful for other systems, such as encyclopedias, archives, or ordinary corporate data.
- Editorial Decision – Many search applications enable editors to make decisions about document quality. This is most often the case for things like corporate websites or e-commerce sites where folks may want to push new initiatives or “hot products” to the top of the list.
Suppose you search on the word “horse”. Should documents which mention “ponies” be returned? What about documents which contain the word “equestrian”, “stallion”, or “the Kentucky Derby”?
These are all examples of relevancy scoring by semantic distance. Documents which mention words or concepts which are more closely related to search terms will be returned higher in the search results.
There are several problems with semantic distance relevancy ranking: First, it requires a semantic network of words, synonyms and concepts which must be maintained. One popular semantic network is WordNet, invented by George A. Miller (Aside: George, wherever you are, your WordNet has been an inspiration to me for nearly a quarter of a century. May you rest in peace.) at Princeton University.
The second problem with semantic distance relevancy ranking is that it returns documents that do not contain the words which the user entered. For example, if you search on “Horses” and the system brings back documents about the “Dangerous Ponies” (a rock band), you’ll likely think that the search engine is broken. The word “ponies” in the music context is so far removed from “horses” that the connection is lost.
However, if you search on “horses” and the search engine returns “Band of Horses” (another rock band), because there is an exact word-to-word match it is clear why the search engine returned the document. It requires less mental effort on the part of the searcher to puzzle out why the document was returned.
Third, it is often hard to tell what meaning of the word is intended by the user. The famous is example is when the user searches on “stocks”. Do you include words like “bonds” or words like “soup” or “cattle”? All of these are types of stock, but if the user has a particular meaning in mind, then returning related terms to some other meaning will certainly look wrong. One solution to this problem is to simply ask the user to choose a meaning. In some domains this is a valid approach.
Similar to semantic distance, pattern distance ranks documents higher which contain words which have more similar character patterns to the words entered by the user.
This is a critical relevancy ranking feature for some content sources, for example classified advertisements or documents created using Optical Character Recognition (OCR), both of which will have a percentage of misspelled words. Classified ads, especially, have such a small amount of content (one or two dozen words) that even a single misspelling represents a large percentage of "the document" and might make the ad un-retrievable without pattern distance relevancy ranking.
Pattern distance is often computed by N-Gram analysis. Documents which contain words which contain the largest number of common N-Grams to the user’s query words will be retrieved first.
GEOGRAPHIC DISTANCE BOOST
There is a fine line between sorting and relevancy ranking and nowhere is that line finer than with geographic distance ranking.
Most systems will simply sort by geographic distance. Unfortunately this causes strange situations where a very weak document is shown first simply because it happens to be the closest to the customer. And so many systems attempt to add a “distance boost” to documents which are closer to the user.
Unfortunately, geographic distance boosting is tricky (and perhaps mathematically impossible) to get right. Turning the geographic distance boost too high and becomes just like sorting by distance. Turn it to low and it’s just like sorting by standard relevancy.
Further, if the distance is shown in the search results, the users will look at it and say: “It’s not sorted by distance, it’s broken.” Trying to explain that distance is just a “preference boost” usually causes people’s eyes to glaze over.
Putting IT ALL TOGETHER
This reminds me of my all-time favorite Broadway song.
Okay, so now you’ve seen all of the various relevancy ranking scores. How in the world is one supposed to combine all of this into a single list of results?
I won’t lie, it is very difficult and something which is more of an art than a science (see the song) – although mathematics usually does play a significant role. The following are some suggestions and possible strategies to consider.
Divide and Conquer
My first recommendation is to divide the problem into four parts:
- Document Quality
- Query Quality
- Document to Query Matching
- External factors (distance, time, category, etc.)
These four different areas are relatively independent, and so it’s possible to tune up each one individually, and then combine them together at the end.
Be Unromantic and Methodical
Don’t stick with relevancy factors which don’t work. Many times I have seen people add in factors which “seemed like a good idea” or “that’s the way it’s always been done”.
Evaluate every element from ground-up. Start with the most basic possible system and then build up your relevancy ranking, step by step. Make sure that each item you add actually improves the results. If it doesn’t, then remove it.
Consider Providing Multiple Results
Sometimes, no matter what you do, you get bad documents. I see this all the time on web site searches, where press releases often overwhelm actually useful documents from the web site itself.
And so, in these situations, consider providing two sets of results. For example, separate the press releases into a separate category and provide them on a side-bar, or grouped together somehow. Look at how google.com groups all of the news together in a query as in the example here (search query: "Baltimore Ravens").
Grouping results and providing multiple results lists is a great way to clean up your search results and provide the user with a good overview of “what’s available”.
When dealing with multiple dimensions of sorting and relevancy, consider ring sorting.
This usually occurs when both time and relevancy are important (for example, when searching over news) or when both distance and relevancy are important (for example, when searching over a business directory).
Rather than just sorting on time (or distance), or just sorting on relevancy, consider creating “rings” or ranges of time or distance, and then sort on relevancy within these rings.
For example, instead of showing an exact number such as 4.32 miles, consider showing a more user-friendly distance such as “0-5 miles”. This represents a “ring” of distance. Then, within all documents that are 0-5 miles away, perform a sub-sort on relevancy.
Similarly with time, instead of saying “4 hours 13 minutes and 42 seconds ago”, perhaps say, simply “today”, “this week” and “this month”. These create “rings” of range. Within each range perform a sub-sort on relevancy.
Ring sorting is a relatively easy way to combine gradient sort values (such as distance and time) with relevancy while providing results which are easy to understand.
If you have access to the internals of the search engine algorithm, you might consider performing a statistical regression analysis to determine optimal weights for all of the components of relevancy.
In order for this to work, you’ll need two things:
- A “judgments database” which identifies relevant documents for a sample query set. This can be gathered from click logs or by manual evaluation.
- Detailed, separated relevancy data from the search engine. Ideally you’ll be able to get separate scores for proximity, completeness, semantic distance, IDF, TF, document quality, etc. for every document returned by every query in your sample query set.
Once you have all this information, you can use regression analysis to best predict relevancy based on the separated data. The end-result will be a formula which returns a normalized (percentage) score for every document. The analysis will further identify how well correlated each of the individual data points is to the overall score, so that low-performing statistics can be removed from the computations.
Of course, once the formula is ready, it needs to be coded directly into the search engine itself. This is where having access to the search engine source code is helpful.
Relevancy ranking for search engines is my favorite topic of discussion, because it touches on all aspects of search, data, and user behavior. I really do believe there’s a certain level of passion and artistic impulse that goes into relevancy ranking to produce search results which have real impact. It’s a balance of a large number of different factors, and always with an eye to how it will be perceived by the end-user.
I realize that most courses, even graduate-level ones like this, don’t take 16 months to complete, so I thank you for sticking with this one to the very end.