Solr Lucene Relevancy Tuning
Improved relevancy directly enhances user productivity and core business objectives
Search Technologies provides a services engagement, typically for an agreed fixed-price, for improving the relevancy of search results within an existing Solr Lucene implementation.
This engagement will provide powerful relevancy ranking improvements in an existing Solr installation. This includes setting up a basic system for relevancy evaluation, based on a set of sample queries, so that improvements can be quantitatively measured.
Additions to the default relevancy formula in Solr Lucene can dramatically improve search results, solving many of the most thorny relevancy problems. For example:
- Reducing the impact of peripheral content (sidebars, ads, tangential discussions, etc.)
- Automatically handling word phrases in a flexible manner, reducing the need to use complex query constructions to obtain good search results
This service can ensure that open Solr-based search applications provide highly relevant results to users. Improvements in relevancy can transform the contribution that a search application makes to a business process.
EXAMPLE FEATURE DESCRIPTIONS
Every services engagement is treated differently, taking full account of the objectives of the application. The sections below illustrate two important methods of Solr relevancy improvement that are often appropriate to a customer's needs.
Parameterized Document Similarity Function
Default Solr Lucene systems are based on a fixed document similarity function that depends heavily on term-frequency / inverse database frequency (tf-idf) statistics. These default implementations put too much weight on document sizes (boosting small documents) and rare terms in relevancy calculations. Search Technologies provides parameterized versions of tf-idf giving substantially more control over the relevancy formulas. This new operator has configurable parameters to determine the exact amount of boost for tf-idf ranking factors and also provides upper and lower thresholds that reduce the effects of unreliable statistics at very low-granularities (when terms only occur in a few documents).
Note: Versions 1.4.1 and 1.4.0 of Solr will require a source code patch
to implement the Parameterized Document Similarity Function. Releases
currently in development (expected to be numbered as version 3.1 or later) can be implemented via a configuration change
and a drop-in library.
Gradient Proximity Boost
Default Solr Lucene systems have a very limited “hard window” proximity boost. If all terms are “within window” the document will receive a fixed boost multiplier. If any term is “out of window” no boost is applied.
The Search Technologies Gradient Proximity Boost operator instead measures the density and completeness of terms across the document. Documents in which terms are clustered close together will be boosted more than documents in which terms are widely distributed, but in a gradual way. This operator eliminates the need to tweak fixed window sizes.
A working Solr / Lucene system with documents already indexed.
TYPICAL ENGAGEMENT TASKS
- Current system evaluation
- Gather basic statistics on the document base (number of documents, average size, number of fields, tokens per document, tokens per field, etc.)
- Gather basic statistics on the query set (number of tokens per query, types of operators used, etc.)
- Gather sample queries for relevancy tuning - typically a set of 20-30 queries gathered from query logs or via interviews with subject matter experts
- Operator installation, configuration, system integration, testing and deployment
- System tuning based on the sample queries
- Demonstration / report on the relevancy improvements achieved
- A working Solr Lucene system with new operators included
- New operator source code (if desired)
- Documentation on operator settings
- A relevancy evaluation report
Search Technologies is able to provide software maintenance and support services, including 24 / 7 options, both for the newly installed operators or for Solr Lucene as a whole.