Back to top

A Big Data Architecture for Search

Kamran Khan
Kamran Khan
Managing Director

“Big Data” has the potential to transform the way we all do business. At Search Technologies, we specialize in addressing unstructured content sources, and helping customers to prepare, analyze and merge insight from human generated content, with structured, machine-generated data. 

Yet, such is the hype around Big Data these days, that if I see it mentioned in the title of a presentation, my first reaction is to roll my eyes and groan. That said, I hope you will forgive the reference in the title of this blog, and read on, because my aim is to describe a practical application of these technologies.

Our core business is helping customers to make their search systems work better. 

The gradual merging of enterprise search technology into the Big Data world is not a one-way street. For sure, the technologies, skills, and expertise, built up over two decades by folks in the enterprise search space, are critical to the effective use of unstructured content in Big Data applications. 

At Search Technologies, we are also looking at things in the other direction. How can we use Big Data technologies to improve search systems?

As a company, we started to take notice of Big Data a few years ago. From the beginning, we wanted to resist the temptation to simply slap a Big Data label across the company, and follow fashion. Instead, we’ve asked questions such as, “How can big data technologies help us to improve the way we serve our customers?” 

The notion of Big Data originates from transactional data sources, such as website log files. Business intelligence, and data warehouse companies have been doing this for decades. Fast-forward to the 2000s, and companies like Google and Amazon realized that there was hidden value in log files which could be mined through analysis. Google created a technique called Map/Reduce, and Doug Cutting, now with Cloudera, picked the idea up and initiated an open source system called Hadoop. In essence, this enables big problems to be split into many smaller problems, processed on distributed hardware, and then merged back together again. Combine this capability with the elastic availability of computing power from Amazon Web Services (AWS) and others, and we have a platform for tackling big problems, that is available to anyone. 

So what are the “big problems” in the search space? 

Ask users of enterprise search systems, and they’ll probably talk about poor relevancy, or lack of access to important content sources. Ask an IT professional, and they may be looking for improved agility and flexibility, to meet the demands of a fast-changing world. 

If you asked us, and we’ve delivered more than 50,000 person-days of consulting and implementation services to hundreds of customers during the past four years alone, we’d point to the area of content processing. That is, the analysis and preparation of content, prior to indexing into the search engine. This important area of search system building is often neglected. Problems such as poor results relevancy are often solved through improved content processing. Almost all of the cool user interface options that users love, such as search navigation options, are based on metadata which has been captured, derived, or generated in the content processing part of the search system.

During the next few years, we believe that the most important improvements in search relevancy and utility, will be driven by innovation in content processing. This means advanced statistical analysis techniques, latent semantic indexing, link and citation mapping, improved near-duplicate detection, and a wide range of other techniques. 

So how does a tool like Hadoop help us? 

First, it enables us to contemplate much more in-depth, sophisticated analysis techniques. Content processing involves token (word) level analysis. A typical enterprise corpus of ten million documents will contain billions of tokens, and calculations that involve the cross-referencing of such a large number of items are very much in Big Data territory. 

Second, we can achieve greatly improved agility in developing and maintaining search systems. Let’s briefly compare a traditional, integrated search architecture with a Hadoop-based, content processing-centric approach.

Getting content processing right involves a lot of iteration. Make changes to the indexing pipeline or content processing system, and it is usually necessary to re-index. But in a typical enterprise search scenario, re-indexing ten million documents will take weeks, because the content must be re-crawled from the various repositories, and the rate at which content can be sucked out of them is usually the limiting factor.

In our new architecture, we create a secure cache of the content, in Hadoop, and update this cache as documents are updated in the corpus. This simple change typically reduces the time taken to re-indexing from weeks to hours, enabling the tempo of development to be transformed, saving money, and enabling better algorithms to be developed and tuned within the time available. 

Here are a couple of examples for you. 

We have a customer with 90 million patents in their data set. Cross-referencing these for citation analysis, and for other sophisticated comparison purposes involves a lot of content processing. However, the result is a richer, more relevant, and substantially more productive search experience for users. 

We are also working with a recruitment company to statistically match job vacancies against a database of millions of CVs, and computing a match percentage, taking into account a wide range of factors such as geographic proximity, salary expectations, skills and expertise. Transformational productivity gains for professional recruiters can be achieved.

An Agile Architecture
Not only does this architecture provide a platform for building better search systems, it can also provide content processing services (such as the normalization, cleansing and enrichment of unstructured content), for business insight applications. This is because we’ve split the content processing tasks away from the core search engine, enabling it to work with any application. We call it structuring the unstructured.

An application-independent content processing layer, running on Hadoop will, we believe, become a must-have infrastructure item in the near future. You heard it here first! 


Click here to see Search Technologies' CEO Kamran Khan give the Keynote presentation on this subject at KMWorld 2013 in Washington, DC.