Solr Lucene Relevancy Tuning
Search Technologies provides a fixed price services engagement for improving the relevancy of search results within an existing Solr Lucene implementation.
This engagement will install two powerful relevancy ranking improvements into an existing Solr installation. Also included is a basic system relevancy evaluation and relevancy tuning exercise based on a small set of sample queries.
Additions to the default relevancy formula in Solr Lucene can dramatically improve search results, solving many of the most thorny relevancy problems including:
- Reducing the impact of peripheral content (sidebars, ads, tangential discussions, etc.)
- Automatically handling word phrases in a flexible manner, reducing the need to use complex query constructions to obtain good search results
This fixed price service can ensure that open source-based search applications provide highly relevant results to users. Improvements in relevancy can transform the contribution that a search application makes to a business process. Relevancy improvements often dramatically increase search system usage and user productivity.
Search Technologies has developed two key improvements to the Solr Lucene relevancy ranking algorithms:
Parameterized Document Similarity Function
Default Solr Lucene systems are based on a fixed document similarity function that depends heavily on term-frequency / inverse database frequency (tf-idf) statistics. These default implementations put too much weight on document sizes (boosting small documents) and rare terms in relevancy calculations. Search Technologies provides parameterized versions of tf-idf giving substantially more control over the relevancy formulas. This new operator has configurable parameters to determine the exact amount of boost for tf-idf ranking factors and also provides upper and lower thresholds that reduce the effects of unreliable statistics at very low-granularities (when terms only occur in a few documents).
Note: Versions 1.4.1 and 1.4.0 of Solr will require a source code patch to implement the Parameterized Document Similarity Function. Releases currently in development (expected to be numbered as version 3.1 or later) can be implemented via a configuration change and a drop-in library.
Gradient Proximity Boost
Default Solr Lucene systems have a very limited “hard window” proximity boost. If all terms are “within window” the document will receive a fixed boost multiplier. If any term is “out of window” no boost is applied.
The Search Technologies Gradient Proximity Boost operator instead measures the density and completeness of terms across the document. Documents in which terms are clustered close together will be boosted more than documents in which terms are widely distributed, but in a gradual way. This operator eliminates the need to tweak fixed window sizes.
A working Solr / Lucene system with documents already indexed.
EXPECTED ENGAGEMENT TASKS
- Current system evaluation
- Gather basic statistics on the document base (number of documents, average size, number of fields, tokens per document, tokens per field, etc.)
- Gather basic statistics on the query set (number of tokens per query, types of operators used, etc.)
- Gather sample queries for relevancy tuning - typically a set of 20-30 queries gathered from query logs or via interviews with subject matter experts
- Operator installation, configuration, system integration, testing and deployment
- System tuning based on the sample queries
- Demonstration / report on the relevancy improvements achieved
- A working Solr Lucene system with new operators included
- New operator source code (if desired)
- Documentation on operator settings
- A relevancy evaluation report
Search Technologies is able to provide software maintenance and support services, including 24 / 7 options, both for the newly installed operators or for Solr Lucene as a whole.