Improving Elasticsearch and Solr Performance with Enhanced Shards' Data Placement
One of the most common ways to improve the performance of software systems is to parallelize the work that needs to be performed. Search engines are no exception: as the data stored in search indices increases, data inside them gets split into multiple shards. These shards are searched in parallel and results are retrieved from them, then merged/sorted before the final search result is produced.
In this blog, I'll provide an overview of a data placement technique which results in substantial improvements in query performance. I'll focus on Elasticsearch and Solr as the main search engines where it can be applied (you can download the tutorial here). Before taking a deep-dive into this technique, it helps to go over a high-level overview how search engines organize the data that they hold.
Both Elasticsearch and Solr are built on top of the Lucene search engine library. Lucene stores the indexed data into a Lucene index which represents the smallest unit of storage data that can stand on its own.
Data stored in an Elasticsearch or a Solr server can be spread across multiple Lucene indices. In both Elasticsearch and Solr terminology, a Lucene index is called a shard. The next higher-level composite data structures (made up of multiple shards) are:
- An Elasticsearch index
- A Solr collection
Data Placement Control
A central premise for getting the best of performance out of the search engines is that the data can be split relatively uniformly over a search engine’s shards. Quite often, though, as data grows inside a search engine, shards end up being unbalanced and/or holding that is not clustered in ways that support the most efficient data retrieval at search time.
Both Elasticsearch and Solr provide a routing scheme to place data into shards based on the murmur3 hashing algorithm of either a routing key provided by the user or based on a unique document id. While routing on document id allows for relatively equal distribution of data over shards, documents that should logically be clustered together can end up being spread across too many shards. On the flip side, a routing based on a user supplied routing key can lead to unbalanced shards.
The distribution of data across shards can sometimes be substantially improved by using more intelligent balancing strategies based on specific knowledge of the business domain and data distributions.
In my tutorial, I provide specific examples (including that of genomics applications) on how to determine a set of routing keys that provide a better data distribution and then how to use these routing keys to map them to a version of keys that the search engine can use for placing data in shards.
An In-Depth Tutorial
When system performance is critical, we recommend paying very close attention to an even distribution of data across the shards of a search engine and employing techniques like the ones described here to supplement the automatic rebalancing that the search engines perform.
Download our full technical tutorial for step-by-step instructions on how to apply this search engine performance improvement technique.