Amazon CloudSearch vs. Solr Cloud
GUEST BLOG: We recently had the opportunity to interview a senior AWS Consultant, on a subject that is of importance to a number of our customers.
How does the Amazon CloudSearch service compare with running Solr Cloud on EC2 servers, or on in-house infrastructure?
As far as Search Technologies is concerned, this is a customer-specific decision, and should be based on a detailed understanding of the customer’s business goals, content landscape, and other circumstances. Our Search Assessment Service is a great way to review and document those factors.
The following notes are from this interview, and have been formatted as a blog article, written in the voice of the AWS Consultant, rather than in our usual independent style. It contains useful information for anyone conducting a Solr vs. Cloudsearch comparison.
This discussion is designed to be a quick comparison between Amazon CloudSearch and Apache Solr Cloud, to enable an informed decision to be made when selecting between these two search solutions. Both use the same underlying search engine, however one is a fully-managed service, while the other will require development, management, and maintenance by the implementer.
To integrate cloud-based search into applications, there are several alternatives. Amazon CloudSearch is a fully-managed service in the AWS Cloud that makes it simple and cost-effective to set up, manage, and scale a search solution for a website or application. Apache Solr is an open-source search platform with a multi-node scaling capability, called Solr Cloud.
With Amazon CloudSearch, you can quickly add search capabilities to your website or application without having to become a search expert, or worry about hardware provisioning, setup, and maintenance. With a few clicks in the AWS Management Console, you can create a search domain and upload the data you want to make searchable. Amazon CloudSearch automatically provisions the required resources and deploys a highly tuned search index.
Apache Solr Cloud provides much the same search functionality as Amazon CloudSearch, but with a more hands-on need for managing resources and building code to do exactly what you want.
Both Amazon CloudSearch and Solr Cloud will automatically distribute index updates to the correct location, distribute searches across multiple search instances, and both provide replication and recovery; however it is how you get to that desired state and then manage & maintain it that is different.
When you create an Amazon CloudSearch domain, it manages all of the resources you need to deploy and scale your solution. CloudSearch manages durable storage of content, hardware needed to serve search traffic, hardware needed to build search indices, load balancing traffic, high-availability replication, node recovery, securing your search service, and software upgrades. With CloudSearch you get access to all of these resources through a single account. Your bill is simplified, with only one charge to understand what you’re spending on search.
Managing your own Apache Solr cluster with Solr Cloud requires you to provision and configure all of these resources on your own. If you do this on AWS you will need to access services like Amazon S3, Amazon EC2, and Amazon Elastic Load Balancing. Your bill will contain line items for all of these services, so you can still track and control costs, but in a less transparent way. If you do this on your own platform, you will need to provision and configure the compute power & storage. Additionally, without the elastic capabilities of AWS, you will need to handle the provisioning of additional resources as the solution scales. Guessing at peak query loads for search applications has traditionally been very difficult. Often, especially for Web-facing applications, high query loads are a symptom of business success, and an embarrassing time to find that your infrastructure has run out of horsepower, and is providing your users with a degraded service.
Amazon CloudSearch is a RESTful service – you send documents to a document endpoint and search requests to a search endpoint, all over HTTP provided through DNS entries. As you send documents, CloudSearch deploys those documents to your index in near real-time, automatically provisioning and deploying hardware with sufficient capacity for those indices. When your data needs to be sharded across multiple hosts, CloudSearch automatically builds the shards and deploys them to hosts with no interruption in query processing.
As you send search traffic to your domain, CloudSearch manages the resources needed to serve that traffic. CloudSearch monitors CPU usage on all nodes in the cluster and when necessary adds additional replicas to serve the traffic volume. If traffic drops off, CloudSearch scales down, removing unnecessary capacity. CloudSearch’s automatic scaling is seamless and does not require you to do additional work - you just send documents and search requests, and CloudSearch manages the resources needed.
You manage your own nodes for Apache Solr Cloud, typically with a with a second software system such as Apache Zookeeper. To do that, you first figure out how many nodes you will need, based on how much data you will have and how large the indices will be to hold that data. Then you decide how much traffic you need to serve and measure, or guess at each node’s capacity to serve that traffic. You provision nodes to store the data, serve search traffic, and configure master and replica nodes through XML files, using Apache Zookeper. If your data volume grows, you reconfigure your cluster manually, again via XML, rebuild the indices and get them into service. Every change to the number of nodes in your cluster requires manual intervention and reconfiguration.
So, with Solr Cloud it is your responsibility to monitor the solution to ensure that a) it has sufficient resources, and that b) it isn’t consuming too many resources. Based on that monitoring you then need to manually ensure that the resource provisioning is correct.
The real difference between these two approaches is seen on rapidly scaling search requirements, or where the search load is unpredictable, such as on large e-commerce websites where traffic changes on time of day / time of year / campaigns being run. Here, the automatic scaling of Amazon CloudSearch is a real advantage. For static website search with predictable traffic loads, once configured, there would not be much difference between the two solutions.
Amazon CloudSearch provides a number of features that make your search solution more resilient to failure. Within a single AWS availability zone (AZ), CloudSearch is able to detect and recover from single, or multi-node failures. When a node fails, CloudSearch can recover the index and updates for that node automatically, because it stores these indices and updates durably.
For even higher availability, Amazon CloudSearch has a Multi-AZ feature that replicates your domain in 2 different AZs. CloudSearch takes care of provisioning and maintaining hardware, deploying indices, managing update replication, load-balancing traffic, and failover when a single node or zone is offline.
With Solr Cloud, you provision for high availability by adding and managing additional nodes yourself. You handle replication of updates through configuration, managed by ZooKeper. Note that if the search resource requirements change that you will have to manage the resourcing changes taking the high availability into account.
Furthermore with Solr Cloud, there is an internal ZooKeeper that comes with the distribution, this could be a liability when developing a high availability Solr Cloud solution as this internal ZooKeeper would not be as redundant as the solution. Therefore for high availability Solr Cloud implementations it is recommended that you set up an external ZooKeeper ensemble, which is similar to starting multiple instances but configured so that they know about and can talk to each other so that they can maintain a quorum in the event of failures.
Both Amazon CloudSearch and Apache Solr provide a rich set of search features including structured and free-text search, faceting, result highlighting, search suggestions, customizable relevance, search across complex expressions, sloppy phrase matching, native support for multiple languages, and term boosting. This is not surprising as they both have the same underlying search technology, albeit that CloudSearch has some additional features to maintain compatibility with the previous A9 search engine.
Maintenance & Support
As a fully-managed service, CloudSearch provides automatic software maintenance. Your search domain is always up-to-date with the latest bug fixes and feature releases. Amazon has world renowned support with different levels of support available: basic, developer, business, and enterprise.
For Solr Cloud, this is a self-managed process. As an Open Source product, it will be up to you to review the latest software changes and determine whether and how they should be applied. It will also be up to you to determine whether those changes will be backwards compatible or not. Support will also be self-service, if you encounter a problem it will be up to you to find the solution, hopefully with the assistance of the Open Source community.
With Amazon CloudSearch, there are no upfront commitments or set-up fees, with the solution being right-sized automatically, so customers are just billed on their usage across the following dimensions:
- Search instances
- Document batch uploads
- IndexDocuments requests
- Data transfer
To enable customers to estimate their costs, AWS provides a simple cost calculator.
For Solr Cloud there are no upfront software costs as the software is Open Source; however the time to configure and implement the Solr Cloud solution is expected to be significantly greater than for CloudSearch particularly for high-availability and / or large scale solutions. In addition the provision of compute, storage and load balancing resources will demand up-front costs unless you go to the cloud for provision of these. Irrespective of how the resources are provided, there will also be potential additional costs if the solution is not right-sized either in poor response times to the users if under-provisioned or additional resource costs if over-provisioned.
Compehensive documentation for both products is available on the web, the base URL for each product is given below.
Amazon CloudSearch and Apache Solr Cloud both provide a comprehensive set of features that can deliver great search for your application.
With CloudSearch you don’t have to manage multiple pieces of software, manually configure, design a replication and high-availability strategy, provision hardware, scale your search engine, or recover failed nodes. With Solr Cloud, these are your responsibility to provide.
Also with CloudSearch, the environment is automatically kept up-to-date, and you have access to world class AWS support; whereas for Solr Cloud you would be responsible for the updates, maintenance and support of the solution.