Replacing Your GSA with an Open Source Alternative (Elasticsearch or Solr)? Top 5 Things to Consider
With Google as a “household name” in search, the GSA provided powerful search features and good relevancy along with ease of use and maintenance out-of-the-box. But one challenge our GSA clients often encounter is tweaking its relevancy and features for their requirements. This is a common limitation with most commercial search solutions.
Open source search engines, such as Elasticsearch and Solr (among the most popular search engines), provide the flexibility for customization and fine-tuning as needed. Migrating from the GSA to open source will also allow the opportunity to add or improve certain search features that were not feasible to achieve on your GSA. So, if you are like 76% of our GSA survey participants who are considering an open source search engine like Elasticsearch or Solr, there are several things to plan for: how to ingest data, how to query the search engine, how to render the results, etc.
In this blog, I will cover five key considerations when replacing your GSA with an open source alternative.
1. Language Detection
The GSA has query-time language detection and index-time language detection. Designing your solution with an open source search engine requires deciding whether you have language-specific content. Content available in multiple languages or with mixed languages in the same document will call for different solutions.
To detect the primary language of the document, Elasticsearch and Solr rely on the ingestion process. Language detection plugins, such as Tika’s language detection or Google’s Compact Language Detector, can be easily integrated to detect the language during the ingestion process. These language detection plugins work well for basic language use cases. However, if you have complex use cases for a specific language, you may need customization or integration with other third-party language detection solutions. An example is Rosette Language Identifier from Basis Technology.
Once the primary language of the document is determined, appropriate language analyzers can be specified. For instance, Elastic mapping can leverage templates to create a mapping that uses language-specific analyzers for different languages. Template names can use patterns such as suffix (e.g. marketing_en for English content or marketing_es for Spanish content). Similarly, in Solr, language-specific field types can be defined (with appropriate Lucene analyzers) and applied at field level.
2. Spell Correction
The GSA’s spell correction, which is easily enabled, is based on the indexed content. This feature is also available on Elasticsearch and Solr. In most cases, spell correction built on the indexed content derived from the selected fields (e.g. title, description, subject, etc.) is sufficient. In Solr 6, you can define spell check fields and the content index that holds the spell check fields. Similarly, in Elasticsearch, Term Suggester can be configured with a specific field for spell suggestion. If you have multiple fields, they can be copied to a single field specified in Term Suggester.
A common drawback is that certain spell corrections don’t provide expected results due to insufficient content, misspellings in the content, or improper expectations. For more specialized use cases, you can build your own spell correction dictionary or consider alternative solutions, such as:
- The technique defined in this ACL 2009 article: using content from the World Wide Web as a corpus to build error model and an n-gram language model to provide language-independent auto-correction.
- Microsoft Azure Cognitive Services APIs: the service consists of over 20 APIs, one of which is the Bing Spell Check API. This API leverages machine learning and statistical methods to provide spell correction based on the content’s language. It is readily available for integration with open source search applications.
Type-ahead or query completion is a popular feature that provides suggestions as the user types a few letters in the search box. The GSA’s type-ahead feature is based on:
- Frequent query terms that return non-zero hits
- A blacklist configured to remove certain terms from your auto-completion list
- No configurable whitelist - if you have terms or phrases to add to your type-ahead list, there is no configurable whitelist to include those terms or phrases. You can have a custom script that executes repetitively to query the GSA so that the query terms with non-zero hits get added to your type-ahead list.
You can achieve similar behavior in open source Elasticsearch or Solr using two approaches:
- Field based: similar to spell correction, the same solution can be leveraged using the indexed content to provide type-ahead selections. In this approach, a field, such as title or description, can be configured to return type-ahead responses.
- Query log based: this approach provides more flexibility for integrating any custom type-ahead terms. It requires a few steps:
- Creating a logging mechanism with logs of queries, hits, and session id's at a minimum. This could get complicated if a language-specific type-ahead feature is required, but is possible to achieve.
- Parsing logs to determine frequent query sessions that return non-zero hits
- Building lists of type-ahead terms and storing them in a separate index
4. Logging and Reporting
The GSA provides basic reports and search logs for top queries, zero hits, etc. As more and more applications are driven by search, it has become critical to incorporate search analytics features. Search analytics would be key to understanding how your users are using search and how you can make improvements.
There are various third-party tools that can be integrated to enable search analytics in Solr and Elasticsearch. Part of the Elastic Stack, Elasticsearch has a more seamless integration with Kibana and Logstash to provide user-friendly analytics reports and visualization. For Solr, there are similar visualization tools, such as Banana, Silk, and Hue.
5. Click-Through Relevancy
The GSA’s click-through relevancy feature is enabled through Advanced Search Reporting (ASR). This feature automatically analyzes your user behavior and clicks on search results pages in order to fine-tune relevancy scores for specific queries. Open source search engines provide flexibility for tweaking the relevancy based on fields and BM25 relevancy parameters. So, it is possible to build a similar relevancy solution on Elasticsearch and Solr using the following steps:
- Developing a custom mechanism for capturing user click data at session level from logs
- Using Learn to Rank in Elastic or Solr to fine-tune search relevancy
Paying attention to the five things above as you move from the GSA to your new search engine can mitigate risks of disruption and poor performance.
Considering Cloud-Based GSA Replacements?
In addition to open source alternatives, there are various commercial and cloud-based GSA replacement options. Many Search-as-a-Service platforms are available as Software-as-a-Service (SaaS), Platform-as-a-Service (PaaS), and Infrastructure-as-a-Service (IaaS). These cloud-based search solutions may be limited to the functionalities supported by the providers but can be customized to some extent and offer great scalability. For example, Azure Search, Amazon CloudSearch, and Google Cloud Search (currently available to Google G Suite customers) are among viable SaaS solutions. And both IaaS and PaaS service models can support any cloud-hosted Elasticsearch or Solr applications (similar to on-premise systems).
If you are considering multiple search engines, our e-book covers ten key criteria for evaluating GSA alternatives. To discuss how we can help you plan and implement a seamless GSA migration, request a free consultation.