Indexing and Querying with Amazon CloudSearch
Part 3 of 4
Finally we get to Amazon CloudSearch! After all of the processing (which, again, is 95% data analysis and just 5% implementation), we can simply post the documents to Amazon CloudSearch using the Aspire “Post-To-XML” stage. All that’s required is to write an XSLT transform from the Aspire internal metadata format to Amazon CloudSearch XML, which looks like this:
This process pretty much worked first time. Indexing to Amazon CloudSearch is quite easy.
All of the fields specified inside <field name=""> are Amazon CloudSearch index fields which I defined for my search domain.
Some comments on the fields and attributes:
- The @version attribute on the <add> command specifies a “version number”, essentially a transaction ID for the <add> transaction.
- This ID number must always increment (for the specified document @id), or else the transaction will be ignored.
- A simple solution (the one which I used) was to use the computer’s epoch time (i.e. the number of seconds since 1/1/1970) as the version number.
- This will ensure that new indexing runs will update documents from old runs.
- The @id attribute on the <add> command specifies the document ID. This must be made up of lowercase letters and numbers.
- Wikipedia maintains an RDBMS record ID for all pages, so I used that for the document ID.
- In other implementations, I’ve used an MD5 of the URL for the document ID.
- For fields which are specified as facets, this version of Amazon CloudSearch is not able return the facet data for each document in the search results.
- This is why all of my facet fields are duplicated: For example “categories” and “f_categories”. One will be used for faceting, and the other will be returned with the search results.
Going the Extra Mile #10: Improving Performance with Batching
The initial indexing runs, using my laptop (!) to Amazon CloudSearch were not terrible, about 10-20 documents per second. Remember that this is streaming data directly from Wikipedia, through my laptop, and into Amazon CloudSearch.
But how could I make it faster? Well, my naïve implementation sent each document as a separate transaction to Amazon CloudSearch. So step 1 would be to batch up multiple documents into a single transaction.
Fortunately, batching is built into the Aspire framework, so after a simple configuration adjustment, I was batching around 200 documents per batch. Indexing performance increased to 50 documents/second!
Going the Extra Mile #11: Improving Performance with Co-Location
But then I thought: What about running Aspire inside the same availability zone as my Amazon CloudSearch search domain? I guessed it might improve performance since we would eliminate a slow network hop by using Amazon’s uber-fast internal network pipes instead.
So I spun up an Amazon EC2 instance inside the same available zone and ran indexing again.
450 dps! Nine times faster! This was great. I had arrived.
Of course, just because I can pump all 7.2 million documents to Amazon CloudSearch in a scant 4 hours doesn’t necessarily mean that they are all indexed right away. But Amazon CloudSearch wasn’t too far behind. And you can help out the engine by reducing the number of stored fields and using Literal fields as much as possible.
Now we have documents indexed into Amazon CloudSearch, so let’s do some queries.
Whenever I research a new engine, I always start with a debug interface. This is a simple interface where you can specify all of your query parameters in an HTML form, submit the search query, and then view the raw results. The debug interface I created for Amazon CloudSearch was on-line as of the time of this writing, and can be accessed here.
Everything worked pretty much out-of-the-box with Amazon CloudSearch. Queries are very fast (even over my 7.2 million documents), facets work great, paging, sorting, etc. are all fine.
The Boolean query language took a little getting used to, since it’s a LISP-style prefix query language:
The ability to facet by numeric range is pretty fun.
Going the Extra Mile #12: Relevancy Ranking
The out-of-the-box default relevancy ranking in Amazon CloudSearch for the Wikipedia database was very good, but not great. Documents retrieved were relevant but it was missing what one would consider to be the “obvious” choices. For example, a search for “Germany” would retrieve Germany at position 183. Most users will want these sorts of matches to be retrieved somewhere in the top 10.
I experimented with a number of different techniques using query manipulation, but nothing was working out in the way that I wanted. But then someone suggested trying Amazon CloudSearch Rank Expressions. Maybe I could come up with a rank expression which would improve my search results?
One characteristic that all engines struggle with is document length. See my discussions about TF and IDF in my Graduate Level Course on Relevancy Ranking. Wikipedia is unusual in that larger documents are generally those which are about more significant subjects, and should be ranked higher.
Thinking along these lines, I also realized that documents with smaller titles are also generally more relevant. For example, “Germany” is a nice small title, and is clearly more relevant (for my query) than “List of German corps in World War II”.
So I went back to my indexer (this happens a lot when working on relevancy ranking – no matter which search engine you are working with) and added two new fields: content_size, the number of characters in the cleansed main content of the Wikipedia page, and title_size, the number of characters in the title.
But now a conundrum: What formula should I use?
To help me figure this out, I went back to my German query and selected some random documents. I entered the documents and all of their statistics (text_relevance, the score from Amazon CloudSearch, content_size, and title_size) into a spreadsheet:
In this spreadsheet, the computed columns are shaded in grey. All other columns come from Amazon CloudSearch, or were pre-computed by Aspire and stored in the index.
In these sorts of situations, I always take the logarithm of the data first. And so I use log10() to convert content_size and title_size to clog and tlog respectively. Why? Well, data which is bounded on one side by zero (it’s impossible to have a document or title of negative size) and unbounded on the other (there is no limit to the size of a document or title) almost always have log normal distributions. If you want to use these numbers in a linear formula, taking the log of the value to create a normal distribution is almost always a good idea.
Also, if you think about it, differences in size matter more when you’re smaller. It’s like my 5 year old nephew, Andrew. A difference of 1 year matters a lot more to a 5 year old than it does to me, a 48 year old.
Once I had the log values, I started with the following formula…
…and simply tweaked the values for CSFAC and TSFAC until the order of the documents “looked” right. My final formula was this:
Note the minus sign before log10(title_size). Remember that we want to reduce the strength of documents with long titles, and this is what the minus sign does.
Anyway, I entered this formula as my Rank Expression into the Amazon CloudSearch console, and then specified it to be the sort-value for the results using the “rank” parameter, and WOW – it made such a huge difference! Now my |Germany| example was number 2 (instead of 183), and all of the top 10 looked really good. Other queries (such as |George Clooney| and |computer|) were also looking really good.
So happy! Stunned, really, that it worked so well, first time.
Going the Extra Mile #13: De-boosting “Wikipedia:” documents
It was all looking really good, but there were a small number of queries (specifically, the |amazon| query) which were retrieving too many documents from the “Wikipedia:” namespace. “Wikipedia” documents, such as Wikipedia:Policies and guidelines are not really encyclopedia articles per se, but are, instead, discussion documents about Wikipedia itself. They are mostly of interest to the Wikipedians (the people who write articles and maintain Wikipedia) and not so much to the rest of us.
So I figured, let’s just reduce the boost of all documents in the “Wikipedia:” name space. Normally I don’t like to use such crude techniques (as boosting an entire document type), but when carefully applied they can be useful.
To make this happen, I first indexed a field (I called it “doc_boost”) which contained an integer number from 0-10 which I mapped to all of the different Wikipedia Namespaces (see Wikipedia:Namespace). Remember that I was already extracting these namespaces as the document type value.
Once I had an integer field for the namespace, I tweaked my rank expression to use it:
Right now the expression only checks to see if the document is in the “Wikipedia” namespace or not, but I could adjust the weights for other types later if needed.
An Aside: Retrieving Large Results Sets
During my query explorations I thought I would stress-test Amazon CloudSearch a bit and retrieve some very large results sets. This is typically a problem for most search engines, because long results sets take up a lot of memory and data, and schlepping that data from server to server slows the whole thing down. Many search engines put a hard-coded limit on the size of the results set, from 1000 to 4000 documents.
So I thought, let’s retrieve 10,000 documents (id’s only) and see what happens. Then 20,000. Then 200,000. Then 500,000!
Amazingly, Amazon CloudSearch is able to return these enormous results sets in just 1-2 seconds. That’s quite remarkable.
This opens up new possibilities in search, I think, especially for large scale textual analysis projects. For example, large results sets are often required for business intelligence to do complex post-search analysis.
< Return to Data Acquisition The User Interface >
We’re very excited to announce that we’re now part of Accenture! Read the announcement here.