Back to top

Adding Link Analysis to Amazon CloudSearch

By Rafael Alfaro, Technical Consultant, Search Technologies

This article describes an enhancement added to the Wikipedia Search Lab, an Amazon CloudSearch-based search experience for Wikipedia.

The goal of this enhancement was to incorporate a new statistic into the Amazon CloudSearch rank expression, based on a count of inbound links to each article from elsewhere in the Wikipedia data set. The rationale being that the number of inbound links can be seen as an indicator of article popularity and/or authority.  In other words, call it "PageRank light."

Implementing this feature requires a two pass scan of all Wikipedia articles:

  • The first pass to count the links
  • The second pass to index the documents into CloudSearch together with their inbound link count, stored as a field

As with previous Search Lab tasks, content processing is handled by the Aspire content processing framework. This streams source data directly from dump files to Amazon CloudSearch, performing various data cleanup and enhancement tasks on the way.  The original processing pipeline creates a sub job for each Wikipedia article. A new function has been added which extracts and accumulates all internal link references, from all Wikipedia articles.

A bottom-up Wikipedia Parser has also been developed to create hierarchical structure of a Wikipedia article, enabling fast iteration through the elements that make up an article.

This hierarchical structure can be visualized as follows:

Each article is represented as a series of Node Blocks of the following types: Template, Text, Internal Link, External Link, or Header. A node block can also be a sub node, enabling a hierarchical structure to be formed as needed. For example:

Simple Link: [[Distributed Processing]]

    TEXT:  "Simple Link:"
         NODE BLOCK:
              TEXT: "Distributed Processing"

This structure allows us to iterate sequentially through all of the elements to find what we are looking for.  For example, the following code snippet prints to screen all internal links of type=3:

job.nodeBlock.each()   {
    if  (it.type==3)  {
         def link = it.getLink();
         print link;

Before we can start accurately counting links, we need to convert all links to the canonical form. This ensures that links pointing at the same page in Wikipedia, but with different syntax are counted together. For example, [[File:dog.jpg]] is the same as [[Image:dog.jpg]], their canonical form is [[File:Dog.jpg]].

Namespaces, aliases, languages and html and percent encoded character references were considered when converting links to the canonical form.

Besides the canonical form, link cleaning also required the removal of subpage links ([[Dog#Biology]] -> [[Dog]]), colon characters at the beginning of a link ([[:Dog]]->[[Dog]]) and underscore characters were removed and replaced by a single space character. Multiple spaces were replaced by a single space ([[Dog____  food    ]] -> [[Dog food]]).
For the scope of this Lab, link counting was made entirely in memory. A big Hash Map was gathering to count all appearances of links across the whole collection. To avoid extra memory usage, language internal links were excluded from our count (we are only indexing English content) as these will never be used during indexing time. This exclusion reduced our memory usage to half its size.
Once all documents were processed, the link counts were stored in a local file to then be loaded by the data processing components during the second pass (indexing stage).

The in-bound link count file is loaded into memory to be available to every sub job processing a Wikipedia article. Each sub job uses the Wikipedia article title (comes in canonical form) to retrieve its link count from the in-memory HashMap, and this is stored as a field (inbound_count) that will be sent to the CloudSearch indexer.
A new Search Domain was created in the CloudSearch engine based on the Search Domain of our previous Search Lab so we could compare results between the two indexes.

A new relevancy ranking expression was created using the existing factors but with the addition of the inbound_count statistic. This new expression reads as follows:

Some Wikipedia articles are not referenced from others,so the plus one is added to avoid an undefined error in the expression.

NOTE: The boost for inbound count is low when doc_boost = 4 (Categories), because the category pages by definition will have a large number of inbound links, so these are presumed to be of a lower value. Every page that belongs to a given category will be linking to that category. We consider that users will not want to see category pages at the top of the search results.
In previous iterations of this project, we had already established a good boost value for the content_size property, so the idea was to retain the weighting given by this property, but add to it according to the inbound_count, to improve the relevance of pages that maybe are not very large, but are more referenced from the outside and thus more relevant for users.
The boost given to inbound_count is slightly bigger than the one given to content_size (50 : 40) but not big enough to overpower the other property.  In some cases, when an article is big enough to be considered relevant, but doesn't have a lot of references from the outside (it could be a new article, or a very specific article), then the content_size property will be the one boosting the relevance to this article. The next table shows an example of how these two properties complement each other (computed columns are shaded - click the image to expand it).

A comparison between the indexes of our two labs was made to measure the precision of the relevance for the first 10 results for a range of queries. In the following table, notice how on Lab 2, the main articles of swimming climb to the top of the ranking and because of the content size factor, these are not present in the first page of results for Lab 1.



For general queries like "swimming", "Greece", "basketball", etc., a search user will expect to see as the first or second result, the main article about these terms, and not something very specific as was happening on Lab 1 results with "swimming".

As we suspected at the beginning of this enhancement project, the introduction of a link reference factor in our relevance expression significantly improved the order in which results are presented to the user, providing a better overall search experience.
This Wikipedia Demo is available at

Watch out for further enhancements soon!

-- Rafael Alfaro