Back to top

The Details: Data Acquisition & Processing

Acquiring and Processing Wikipedia Data for Amazon CloudSearch

Paul Nelson
Paul Nelson
Innovation Lead

Part 2 of 4


Every journey begins with a single step, and my first step was to acquire the data.

Fortunately, Wikipedia dumps of all pages in a convenient (and well formed) XML format, these can be found at Once we figured out which site we wanted (enwiki), and then which version (latest), and then which files (articles), we were able to download and decompress them, and voilà! We had acquired Wikipedia.

Going the Extra Mile #1: Streaming Decompression
Decompressing each of the files individually is a pain, and the resulting decompressed file takes up a lot of space. What if we could decompress the data on the fly? How cool would that be? Then we could decompress it and then index it right away, and never have to store the decompressed version on disk.

Our content processing platform, Aspire, is a perfect framework for this, but unfortunately it did not have a BZip2 Decompression component. But no problem, we just created a new Aspire component using Apache Commons Compress, to write the data string with a BZip2 streaming decompressor, and we’re off and running. Now streaming BZip2 decompression is available to everyone using the Aspire platform.

Going the Extra Mile #2: Multi-Threaded Processing
In our first version, we just processed one file at a time, single-threaded. But we had plenty of machine power sitting around unused on our 4-core computer. Why not process two files at once? No problem with Aspire. Every pipeline manager is a thread pool, so just create a sub-job for every file, and it just automatically starts processing multiple files simultaneously. Cool – so much faster!

Going the Extra Mile #3: Indexing Directly from Wikipedia
And now, what if we could index the files directly from Wikipedia? In other words, just stream the data directly from the dump servers, through the BZip2 decompressor, and then into the indexer, and never save anything to disk at all. Wouldn’t that be cool?

And not too hard, as it turns out. First we fetch the list of files from using Aspire's simple “Fetch URL” component. The web page is XML (XHTML, actually), so we can load it directly into Aspire with “LoadXML” and then use XPath and Groovy to pick out the URLs we want. (This turned out to be a bit more complicated, because one of the files, articles27.xml, occurred multiple times and so we had to fetch them all, sort by date, and then choose the most recent one.)

Okay, now we have a list of 27 URLs pointing to dump files. This is where Aspire really shines. We create a separate sub-job for each file and pass them to a sub-pipeline. The sub pipeline can now fetch the file data from across the web using Fetch URL and then stream the data directly to the BZip2 decompressor. It works great. Even better, because Aspire is naturally multi-threaded, we’re fetching multiple URLs and streaming them over the wire at the same time.


At one level, indexing Wikipedia data was easy. Dump file data is a simple XML format which can be handled automatically with the XML Sub Job Extractor (an Aspire component which uses SAX to split up large XML files into individual jobs). After splitting, we send the individual pieces to the indexer (see next section) where we map the XML elements to Amazon CloudSearch index fields, and we’re done.

The full Aspire indexing diagram now looks like this:

But of course that is only a very naïve implementation. In order to create something special, we need to dig into the data some more.

Going the Extra Mile #4: Removing non-documents
There are a number of Wikipedia pages we don’t care about. Specifically, the first document is <siteinfo>, not a real article. Other pages either have no content, or are redirect pages, like this:

So, no problem. We can easily detect these situations with a Groovy script and then terminate those jobs, effectively removing these documents from the index.

Going the Extra Mile #5: Extracting Categories
One of the first, “wouldn’t it be cool if…” thoughts we had was to use Wikipedia categories as facets in Amazon CloudSearch. Categories are specified inside the document content as internal links:

 [[Category:Living people]]

Turns out, extracting categories is not hard. We have the text from the XML file, we just use a regular expression (in our Groovy stage) to scan through the text and extract all of the category names. We then write these categories to a tag, and then we can map them to Amazon CloudSearch index fields at index time (see below).

Going the Extra Mile #6: Name Space Document Types
Wikipedia has a number of different document types, called “namespaces” in Wikipedia lingo. These include: Wikipedia (about Wikipedia), “File”, “Category”, “User”, “User Talk”, “Portal”, “Template”, “Help”, etc. Many of these pages are pretty specific to the internals of the Wikipedia environment – and not actual article text.

So wouldn’t it be nice to have a facet for choosing document type? That way someone who wanted to search only for "Help" pages could easily do so. This is another easy task in Groovy: just extract all of the text in the document title up to “:”, verify that it’s one of the official name spaces, and then write the result into a “type” field.

Going the Extra Mile #7: Disambiguation Pages
But there is one more special type that’s not an official “namespace”, and that’s Disambiguation pages. For example: Love (disambiguation). Identifying disambiguation pages turns out to be pretty complicated. Some pages have “(disambiguation)” in the title, but others (see Coke) do not.

We also handle this with Groovy scripting and regular expression matching. We check for “(disambiguation)” in the title, and then also look for any of the 33+ different templates which signal a disambiguation page, such as {{disambig}} or {{Call sign disambiguation}} , and make sure that we don’t accidentally confuse them with any of the “disambiguation needed” templates, such as {{disambig needed}}.

This requires both a “white listed” regex and a “black listed” regex, but still is, fortunately, just a few lines of code. Again this is typical: 98% of the time is spent researching the data to determine what needs to be done, and then 2% actually doing it.

And be sure to check out one of my many favorite Wikipedia pages: Disambiguation (disambiguation).

Going the Extra Mile #8: Facets for Update Date
I thought it might be fun to have a graph of updates, to see how documents have been updated over time. So we used a few lines of Groovy to create two fields: year, and year_month for the last update date for each Wikipedia page. These are now available as facets.

This turned out to be less useful than I expected. The basic Wikipedia data only contains the most recent update, and since there are a lot of Wikipedia robots running around cleaning things up, this most recent update is almost always sometime in the last year.

Still, it makes for some nice looking charts. Perhaps someday, when we start doing some real “Big Data” processing with Wikipedia, we’ll process the entire history and be able to provide a better sense of when updates occurred.

Going the Extra Mile #9: Producing the teaser
Finally, we wanted to produce a teaser for each document. For all search engines, it’s valuable to have a “Static Teaser” – basically a brief summary (usually the first paragraph) of the document to display to give the user an idea as to the content of the document. Many search engines also have a “Dynamic Teaser” (snippets from the document which contain highlighted query terms), but dynamic teasers are not always that reliable, and so having a static teaser is always a good idea as a fallback.

And so naturally we thought, “Okay, let’s just extract the first 500 characters from the article”.

Unfortunately, lots of Wikipedia pages have handy summaries of the text in an “Info Box” which displays as an inset on the upper right hand side of the page. These info boxes are done with templates and look ugly:

So, this began a long exploration into Wikipedia markup, and how to produce the best presentation. There’s still a lot of work to be done here (especially where templates are concerned), but here are some of the things we did to produce better looking teasers:

  • Remove all templates
  • Replace internal links with the link text
  • Replace external links with link text
  • Remove single-quote markup, such as ' ' ' and ' ' for specifying bold & italics
  • Remove HTML tags such as <ref> 
  • Remove empty parenthesis (because all of the contents have been removed by previous cleanups)
  • Extract captions from embedded [[File:…|Caption]] links, and replace them with (Image:Caption)
  • Replace === equals-sign headers === with “Header:”
  • Remove magic words such as __TOC__ and __NOTOC__

   < Return to Blog Summary                                                    Next: Indexing & Querying >