Back to top

Searching Wikipedia with Amazon CloudSearch

A project to create the ultimate search engine for Wikipedia

When I heard that Amazon was going to create a new web service called CloudSearch, of course I was thrilled to have the chance to try it out. It’s rare to be given the opportunity to really play with a new engine, and of course I’ve known about the A9 search team for many years now. 


But what data should I index? I wanted data with real substance to put Amazon CloudSearch through its paces. It should have at least a million documents, full text, and some obvious navigators. And even better, it would be great if the data was useful somehow. So naturally I thought: “Oh, let’s use Wikipedia!” 

Like most people, I use Wikipedia several times a day to look up everything from movie actors to statistical regression algorithms. And also like most people, I think that Wikipedia search is really terrible. It uses an open source search engine (Sphinx) and it’s not a very good implementation. The presentation of the results is poor, there are no facets, and accuracy is not that great. It’s a search experience that’s crying out for some tender loving care. 

And thus began my great Wikipedia and Amazon CloudSearch journey. 

THE EXECUTIVE SUMMARY
As I wrote this blog, I realized how detailed and potentially boring it was… at least to anyone who spends more time with human beings than with their Eclipse IDE. 

And so, as much as I hope you’ll read through the minutia of search engine implementation details below, I wrote this summary so you can spend more of your valuable time with family and pets instead. 

Data Acquisition 
Just to be super cool, we fetch Wikipedia dump files directly from Wikipedia, and stream them through our Aspire content processing framework. At no point do we actually have to download files to disk. It just magically flows from Wikipedia, through Aspire, and into CloudSearch. We index the whole thing with a push of a single button. 

Data Processing 
Using Aspire components, we dice up the dump-file XML into individual documents (one per Wikipedia page), extract categories, do some cleanup (remove unnecessary tagging), create the teaser, convert dates, etc. 

Turns out, Wikipedia data is depressingly complicated. Wikipedia pages have templates inside of templates inside of external links inside of internal links, and so on. It’s a mess, but frankly that’s typical of any large, richly textured data set written by human beings. There’s a lot more work we could do in this area. 

Indexing 
Indexing into Amazon CloudSearch was super easy. We already had an Aspire “Post XML” stage which did exactly what was needed. We just wrote an XSLT transform to map Aspire metadata into Amazon CloudSearch index fields and it pretty much worked first time. 

There were, however, a few Amazon CloudSearch specific details that required some special TLC:

  1. The “version” attribute – essentially a transaction ID to ensure correct transaction ordering – needs to be unique and larger for every update. We used the crawl start-time for the version attribute.
  2. Dates needed to be converted to integers (number of minutes since 1/1/1970).
  3. Identifiers are not allowed to include punctuation or upper-case letters.
  4. Since this initial version of Amazon CloudSearch has no dynamic teasers, we created a static teaser from the first 500 (cleansed) characters of text.

A couple of other quick comments: Batching helps indexing performance a lot, and co-locating your indexer on the same availability zone as your Amazon CloudSearch search domain also helps. You can send documents to Amazon CloudSearch really fast (nearly 500 docs/second in my test run). 

Query and Relevancy Ranking 
The query API for Amazon CloudSearch is a simple RESTful interface with useful parameters which work as expected. Searches worked great and facets were a breeze. All other features (paging, returned fields, filters, etc.) worked as expected. 

Initial relevancy for our Wikipedia database was good, but not great. After some analysis, we realized that a better formula for handling large documents was required. Fortunately, the Amazon CloudSearch rank expression fit the bill perfectly, and now relevancy is excellent. 

One more thing that’s quite amazing, is that Amazon CloudSearch can retrieve enormous results sets of 100’s of thousands of rows. This is very unusual, and unlike any other search engine I’ve dealt with before. I think this means Amazon CloudSearch will prove to be an especially good engine for business analytics. 

User Interface 
To build the user interface, we went the extra mile: First we created a Java API to Amazon CloudSearch (let us know if you want a copy!) and then we used that to build a Twigkit Platform for Amazon CloudSearch.

Twigkit is great because they’ve handled all of those annoying (to me) user interface details like facet navigation, results presentation, paging, etc. etc. AND they have some cool features like charts and graphs which are super easy to implement. 

And so, once the pieces were complete and plugged together (and then after a lot of tweaking), voilà! 

To summarize the summary:

  • We are using Aspire for content preparation
  • Amazon CloudSearch as the search engine
  • Twigkit (and a new CloudSearch Java API) for the user interface.


That was fun! I should do summaries for all my blogs. And now for the gloriously gory details…

THE DETAILS

0