Software Used by the Wikipedia Search Lab
Note: A comprehensive description of the building of the Wikipedia Search Lab, including a more detailed account of the role played by the software components involved, can be found at Searching Wikipedia with Amazon CloudSearch
HIGH LEVEL ARCHITECTURE
Notes to the above diagram:
Aspire is Search Technologies content processing platform. This is a search engine independent java framework for acquiring, cleaning, normalizing and enriching content to a consistently high standard.
Aspire is also used to create Premium Data Connectors for Amazon CloudSearch. So for example, SharePoint data, file systems, database information or data held in Amazon S3 Buckets (to name just a few) can be easily streamed to Amazon CloudSearch, adding value through content cleaning, normalization and metadata enrichment along the way.
For the Wikipedia Search Lab, Aspire automatically downloads Wikipedia dump files directly from Wikipedia and streams them through to Amazon CloudSearch. During this process, Aspire performs data cleanup, metadata extraction, creates static teasers for search results, and performs a number of other functions to support search quality and user interface functionality.
Amazon CloudSearch indexes Aspire's xml output, achieving indexing rates of 400+ documents per second.
Amazon CloudSearch provides search results via a RESTful/xml interface. This is connected to the TwigKit rapid development user interface toolkit via a new Java API for Amazon CloudSearch, written by Search Technologies.
Contact us if you would like to learn more about this project.