Back to top

Statistical Relevancy Evaluation & Measurement

For search applications that support critical business processes, search relevancy is a key success factor. As with beauty, relevance is in the eye of the beholder.  In other words, it is somewhat subjective. 

Working with our customers, Search Technologies has developed an empirical approach to the evaluation and measurement of search relevancy.

This process gathers data and installs procedures to create a repeatable and statistically valid system for evaluating search engine relevancy. 

It is initiated by creating a relevancy database, either by mining user search query and click-through logs, or by other computer-assisted means. This database of queries and matching documents is then used to compute an overall search engine score as well as other statistics, such as:

  • Average position of the first relevant document
  • The number of relevant documents per query
  • Queries resulting in no relevant documents, etc.

This engagement provides statistically valid empirical data on search engine performance. Business benefits include:

  • Provision of statistics, which can be correlated to user behavior and revenue gains though A/B testing
  • Search engine quality can be maintained and improved across production deployments
  • Lowers the risk that search engine bugs will adversely affect relevancy
  • Provides a metric to gauge the efficacy of search engine software improvements and helps determine if resources are being spent effectively
  • Creates an objective measure of overall relevancy, eliminating endless relevancy adjustments (one step forward, one step back) which often occur when competing interested parties all have a say in relevancy adjustment 
  • Can provide metrics for comparing search engines side-by-side, to help with search engine evaluation

There are two parts to this engagement: 

1) Assembling the relevancy database
2) Computing the scores and statistics

These processes have been adapted and refined from standard procedures used by the Text REtrieval Conferences, held by the National Institute of Standards and Technology ("NIST," a  U.S. Government organization). See more details. 

Part 1:  Assembling the Relevancy Database  
Relevancy scoring uses a database of queries in which, for each query, a list of "relevant" documents is provided.  This information can be processed in a number of ways including:

Query and Click Log Analysis. The preferred method for creating the relevancy database is from query and click logs from existing system use. This method is preferred because the data represents actual usage, without bias.  However, query and click volumes must be large enough to sustain statistical scrutiny.  Tasks include:

  • Gathering and parsing logfiles
  • Identifying users (via session IDs or user IDs, or a cross reference of both)
  • Determining the best method for judging user interest in documents, e.g. click monitoring and time spent viewing clicked documents.

Manual Relevancy Judgement.  For systems without a live production or beta history of log file data, or where insufficient log file data is available, a database of manual relevancy judgments must be created. This is done using a “relevancy gathering” interface developed at Search Technologies, which does the following:

  • Presents a query from the query database, along with a brief description of query intent
  • Allows the user to execute the query (as well as other related queries, at the user’s discretion)
  • Provides an interface to browse search results and document which hits are relevant to the query
  • All relevancy judgments are stored in a relevancy database
  • Relevancy judgments will be made by data analysts familiar with text search systems

Part 2: Engine Scoring and Statistics
Once an appropriate database has been created, the search engine can be scored. Typically overall scoring will include the position of the most relevant document in the search results. The preferred formula for scoring a query is shown below. 

Assume a set of queries (Q), for each query a set of documents (Dq), and for each document a relevancy judgment (Rqd), then:

In this formula:

  • Q = The count of the total number of queries in the test set
  • Dq = The count of the total documents returned by query q
  • K = Is a factor that determines the decay in importance of search results starting from hit number 1 (see below). K typically varies from 0.5 to 0.9999. Values closer to 1.0 will increase the preference for documents lower in the results list

The formula returns:

  • 1.0 if the first document of every query is a relevant result
  • < 1.0 if the first relevant document is farther down the results list
  • > 1.0 if there are multiple relevant documents(s) in the results list

Typically, the formula will return relatively small numbers, such as 0.2, indicating plenty of room to improve the search results. This formula is chosen because the maximum possible value, if all search results are relevant, is equal to 1/(1-K). 

In addition to the overall engine score, descriptive statistics and other analyses will be computed including:

  • Percentage of queries with a relevant document
  • Percentage without any relevant document
  • Descriptive statistics on the first matching query term within the results (median, mean, standard deviation, and histogram)
  • Analysis of individual cases to categorize problems, including implementing fixes and then re-scoring to show the improvement that can be made
  • Categorizing queries based on type, and then computing scores per query type

Typical Engagement Tasks

  1. Initial analysis of document data, search engine, and user community.
  2. Initial software installation, setup, and configuration. This includes software for query and relevancy databases, manual relevancy judgment interfaces, executing queries, and scoring engines
  3. Query gathering: From log files, or based on interviews with subject matter experts
  4. Build the relevancy judgment database using search engine logs and web interface click logs, or using other methods
  5. Configure the engine scoring software to read queries and relevancy judgments, execute queries, score the engine and gather additional statistics
  6. Produce the final scoring report
  7. Project management


  1. Query and relevancy judgment database
  2. Search engine relevancy scoring software and usage instructions
  3. Relevancy Performance report, including improvement recommendations

Search Technologies also provides an ongoing service to periodically monitor and assess relevance, and recommend and implement appropriate changes.

For further information about this service contact us