Back to top

Relevancy Ranking Course – A Four-Part Series

The Graduate Level Course

This article is Part 1 of a series.

Read the rest of the series:

So, I was going to write an article entitled “Relevancy 101”, but that seemed too shallow for what has become a major area of academic research. And so here we are with a Graduate-Level Course. Grab your book-bag, some Cheetos and a Mountain Dew, and let’s kick back and talk search engine relevancy. 

I have blogged about relevancy before (see “What does ‘relevant’ mean?)”, but that was a more philosophical discussion of relevancy. The purpose of this blog is to go in-depth into the different types of relevancy, how they’re computed, and what they’re good for. I’ll do my best to avoid math, but no guarantees.

Term Frequency (TF) 
By far the most basic relevancy ranking criteria, and the very first one to be implemented by search engines as far back as the 1970’s, is “Term Frequency”. This is simply the count of the number of occurrences of a query term in the document. Documents that contain more mentions of the query term are considered to be more relevant than documents with fewer mentions. Multiplied by IDF (see below) this provides the standard “TF-IDF” statistic (see Wikipedia entries for non-normalized TF-IDF, and the document-size normalized version known as Okapi BM25). 

The word “frequency” is really a misnomer. Most algorithms use a simple occurrence count, and so the “frequency” is simply “2” if there are two occurrences, “3” if there are three occurrences, etc. 

This reveals a common problem with Term Frequency. It tends to prefer larger documents over smaller documents, simply because large documents are more likely to contain more terms and therefore more mentions of the query word. 

Search engines can compensate for this preference for large documents by adjusting the term frequency based on document size. Unfortunately, this can then cause problems with very small documents (directories, classified ads, micro-blogs, etc.), where a single term match in a 5 word document becomes much more important than a single term match in a 10-word document, although the actual linguistic difference is really negligible. 

A second problem with Term Frequency is that small differences can have a profound effect on the final score. A document that contains two occurrences of the query term will have a score that is twice as large as a document that contains only a single occurrence. Such large differences in ranking values tend not to be supported by the differences in language usage, which are often incidental.

Inverse Document Frequency (IDF) 
Information theory (aka Shannon's Theory) states that the amount of information carried in a data token is inversely proportional to its frequency. High frequency tokens contain little information. Low frequency tokens contain a lot of information. These theories are the basis of the Inverse Database Frequency (IDF) statistic, which says that words are more important if they occur in fewer documents, and that documents that contain more of these sorts of words will be more relevant to the user. 

Early on, search engine designers worried about users searching on very frequent terms which ended up not being that useful. Words such as “about”, “very”, “contain”, etc. are so often used in standard English that they can skew search results in the wrong direction when they are included in the query. Results are often more targeted when these words are removed. 

The usual formula for IDF for term T is:

IDF(T) = log(N/dbCount[T]) 

where “N” is the count of all documents in the database (or some other very large constant – intended to scale up the input to the log() function to prevent negative IDF values) and “dbCount[T]” is the count of all documents that contain the term T. 

Note that IDF is dependent on the query term (T) and the database as a whole. In particular, it does not vary from document to document. Therefore, IDF will have no effect on 1-word queries. Putting IDF with TF, you get the standard “TF-IDF” formula:

 TF-IDF(T) = TF * log(N/dbCount[T])

Personally, I’m not a big fan of the Inverse Document Frequency statistic. It is telling that the Wikipedia entry on Information Theory contains the following statement:

Note that these concerns have nothing to do with the importance of messages. For example, a platitude such as "Thank you; come again" takes about as long to say or write as the urgent plea, "Call an ambulance!" while the latter may be more important and more meaningful in many contexts. Information theory, however, does not consider message importance or meaning, as these are matters of the quality of data rather than the quantity and readability of data, the latter of which is determined solely by probabilities.

This statement concisely points out the problems with an over-reliance on the IDF statistic. Specifically:

    1. IDF does not consider statistical sample size. The statistics for words that occur in a small number of documents are, because of reduced sample sizes, very unreliable. Another way to state this is that words that occur very infrequently have very low confidence values. This issue of diminishing confidence is not incorporated into IDF.
    2. Along the same lines, misspelled words may occur very infrequently, and are therefore rated highly by IDF, but will most often not be that useful.
    3. Combinations of very common words may themselves be very important. An example of this is the rock band “The Who”, made up of two very frequently used words, but which is very specific when put together.
    4. Small differences in word count, while significant to the IDF score, may not be that different to the user. For example, a word that occurs in 10 documents is not likely to be that much more important than a word that occurs in 100 documents, out of a database of a million documents.
    5. Words entered by the user are the best judgment of what’s important. Algorithms that over-weight term quality will inevitably cause problems for a large percentage of queries, undervaluing words that are truly important to the user.
    6. If the search engine is fast enough, and if dynamic teasers are turned on, users who chose poor query terms will quickly realize that these words should be removed, limiting the effectiveness of IDF over the entire search experience.

The last comment in the list above is an important one. IDF is often thought of as the “cure” for supposedly natural-language type queries such as “please find me information on baking a turkey.” But nobody actually searches like that. Instead, people will enter queries such as “bake turkey” – mostly because users are lazy and don’t want to enter all that text into a search engine. With good queries, where all words are important – or at least important in the user’s eyes – too much IDF can produce poor results. 

Finally, the fact that IDF is dependent on the database as a whole poses enormous difficulties for search engine architectures, especially when searching over very large databases. Typically, indexes will be able to easily produce “dbCount[T]” and “N” from the index structures, but only on a single index stored on a single machine. With large indexes distributed across multiple indexes and multiple machines, IDF will necessarily vary from index to index and machine to machine, causing inconsistencies in relevancy scoring. Stated more succinctly, the relevancy score for document X can vary depending on which index contains document X – a rather non-intuitive side-effect of IDF. 

Further, these variances may be significant if there is an inherent bias in the domain or type of documents in different distributed indexes. For example, an index of history papers will contain a very different statistical word usage profile compared to a dump from the corporate E-mail server. Systems with such wide variations in word usage make it virtually impossible to compare relevancy scores. A second example of this problem is with time-based index archives, since language drift over time quickly becomes statistically significant. 

Often these effects can be mitigated by randomly distributing documents across machines and indexes. Unfortunately, however, such random distributions often get in the way of index maintenance for large, time-oriented databases. Some search engines will go so far as to implement a two-pass query; the first pass collects IDF(T) numbers for all query terms across all participating indexes, and the second pass performs the actual search. Naturally, this slows down and complicates the search process. 

Other Term Weighting Techniques 
The purpose of IDF as described above is to weight terms in the query based on their “information value”. In other words, very generic words that are used a lot will not affect relevancy as much as infrequent, very specific terms. But IDF is not the only method for determining term weights. There are many other methods for term weighting. Here are some of the more interesting techniques:

    1. Dictionary Based Methods – Look up each query term in a dictionary and use dictionary data to determine the term weight. For example: 
      • Weight adverbs and adjectives lower than nouns
      • Weight nouns over verbs
      • Weight articles (such as “the”, “a”, “an”, etc.) low
      • Weight terms with many dictionary meanings (i.e. highly ambiguous terms) lower than terms with fewer dictionary meanings
    2. Stop Words – Much smaller than a full dictionary, “stop words” are frequently occurring words that may simply be removed from relevancy scoring completely
    3. Term length – Words that are longer (have more characters), such as “computability” are considered to be more valuable than small words such as “is”, “go”, etc.
    4. Term Cohesiveness – By analyzing the context around each occurrence of a term, one can compute it’s “cohesiveness”, or in other words, how often the term is used within a similar context. For example, terms that are used with the same set of neighboring terms are “more cohesive” and may therefore be considered to be more useful (less ambiguous) search terms
    5. Position In Query – Terms at the start of the query may be considered to be more important than terms at the end of the query
    6. Frequency of Occurrence in Query – When queries are very large, for example when cutting-and-pasting whole paragraphs into the query box or when using an entire document as the “query” for similarity searching, then the number of occurrences of a term in the query can be used to help determine the appropriate term weight
    7. Capitalization – Some early search engines (Alta Vista was one of them) would add additional weight to capitalized terms entered by the user. It was assumed that these terms represented proper nouns that would be more important as search terms. A similar argument has been made about acronyms in ALL CAPS.

The use of any of the above techniques in commercial engines today is rare. All of them have been tried, but IDF remains the dominant term-weighting statistic. This may change if better and more accurate natural language dictionaries become available. 

It is important to note that all of these techniques are used to weight terms in the query, so that more useful terms are strengthened over less useful terms. This means that documents with more mentions of higher weighted terms will be retrieved over documents with the same number of mentions of a lower weighted term. 

Therefore, no amount of term weighting will improve results for a single word query. A single word query will retrieve the same documents in the same order no matter what weight is ultimately assigned to it. 

End of Part 1 
And that ends part one of my graduate-level course on Relevancy Ranking. Stay tuned for Part 2, which will cover more critically important relevancy factors, including Field Weighting, Completeness, and Proximity. 

Oh, and I heard there’s going to be a mixer at Phi Beta Kappa. See you there. 

--- Paul 

Relevancy Ranking Course, Part 2