Back to top

Search for Big Data and Big Data for Search

Paul Nelson
Paul Nelson
Innovation Lead

Search and Big Data have gone together for a long time now. As early as 2009 we researched Hadoop for a large publishing customer. In 2011, we started seeing search over log files as a use case. 

Search Technologies has been officially involved with Big Data since 2008, when we took over responsibility for a massive recommendations engine at a large cable company. That system currently processes more than four billion clicks every day. Originally a Python-based Map/Reduce system, we migrated it to Hadoop, and put it into production in 2010. 

And so, I’ve been researching Big Data and (especially) its implications for search for at least seven years now. But I’ve never felt like I had my arms around the topic well enough to write a “Search Chronicles” entry about it. 

Until now. 

Previously, search for Big Data had always felt (to me) like a sideshow. It was a way to provide a quick interactive search over log file data. Since log files had little (if any) full text content, it was just a curiosity for me.

It took three realizations to change my views on this subject.

Realization #1: It is time for search systems to maintain a (refined) copy of the data

I’m an old-timer. Seriously. I’ve been writing software since 1978.

And like my depression-era parents, I’ve had a hard time letting go of some early biases, such as an over emphasis on efficient use of hardware and disk space. I can remember working as an engineer in Westinghouse, and marveling over the massive “1 gigabyte” disk drive for our Apollo workstations (it was the size of a washing machine). It is these experiences which lead to biases which must be unlearned over and over again. 

Ever since starting my own search engine company in 1989 (my first search engine ran off a floppy disk in 640K of RAM on a Windows PC), the search engine has always been “just an index.” “We don’t make a copy of your data,” I would say, confidently to my customers. “We are just an index. The original data stays in your source system.” 

But now that has to change. And, if we accept that a copy of the data is inside the search system (note that I am deliberately saying “search system” here, to encompass the infrastructure around the search engine, as well as the engine itself), we open up search to a much wider range of possibilities, including:

  • Fast, bulk re-processing and re-indexing, which means ...
Many search systems today take days, weeks, or even months to re-process and re-index all the data. This is because data needs to be re-acquired from the original content sources. Having a copy of this data in the search system enables fast re-processing and re-indexing, reducing these times to mere minutes or hours.
  • ... that indexing is much more powerful ...
“If we did that, we’d be re-indexing all the time.” 

This statement occurs frequently when discussing search engine architectures. But what if re-indexing and re-processing is no big deal? What if we re-indexed everything all the time, just because we can? 

Such systems begin to shift the nature of the algorithms that we can reasonably expect a search system to handle. For example, we can now handle exotic relevancy algorithms, and complex search security requirements at index time, rather than having to do all such computations at query time. This further expands the scope and power of the search system as a whole.
  • ... and agile, while ...
The faster we can index, the more our search system will be able to handle changes in content processing, content understanding, and the simple fixing of indexing bugs. It becomes agility personified. 

This will shift how search systems are deployed. Gather data on day one and then defer your content processing decisions until later. That strange field X from content source Y? Don’t worry about it now. Worry about it later. Or never. Perhaps it was not that important to begin with. 

And so, search systems become much more plastic, manipulatable, much easier to “play with it until it’s working.” And much more fun, too.
  •  ... simultaneously putting less strain on legacy systems and ...
If search maintains a copy of the data, then it does not need to be re-acquired from original source systems whenever you need to re-process and re-index the content. 

This means less strain on your repositories, legacy systems, and networks. Once they have served up one copy of the document that should be enough.
  • .... achieving better separation between content acquisition, content processing, and indexing.
Architects like it when subsystems are separated, with clear, defined boundaries, wherever possible. Standard search engine architectures that combine content acquisition, content processing, and indexing into a single system, are inflexible, hard to upgrade, and prone to error. There are too many moving parts. 

Separating content acquisition from content processing – by putting a Big Data buffer (a copy of the data) between them, dramatically increases search system stability and reliability. Each subsystem can be individually evolved, tested, and project-managed, independent of the other.
  • And finally...
Having a copy of the data makes "Big Data and Search" possible.

How does one do Big Data processing without data? 

Answer: You can’t, so it is necessary to accept that the search system is now more than “just an index.” Much more. Having a copy of the original data to work on, makes everything in this paper possible. It is a pre-requisite. 

But still, there are legitimate concerns about copying all of the data into the search system:

  1. It’s a lot of data
  2. It’s not secure

These concerns must be seriously addressed before search systems using Big Data capabilities will become a necessary corporate infrastructure. 

In regards to #1 (it’s a lot of data!): We don’t actually need to keep a copy of every single bit and byte, on every single hard drive. Easily 95% of data on most corporate networks is not useful for search or Big Data (today). Once the text is extracted from PowerPoint, the original 10MB PPT can be thrown away. The same is true for images (OCR text extraction), video and audio (voice to text extraction). Yes, you can argue that some of that original data could be useful for search, but for most applications, those days are far enough into the future that we won’t worry about it for now. 

The concern with #2 (it’s not secure!) usually has to do with having copies of sensitive documents “in the clear” inside the search system. Data was safer if it was written directly to the search index, because these are large binary files that are hard to navigate and read. 

Here's a true story, from sometime around 1998. A systems administrator went directly to our Outlook server, unauthorized, and browsed my E-mail looking for something to use against me. So I know all about the mischief that nefarious administrators (Edward Snowden, anyone?) can get into when data is unprotected. 

And so, let's solve this second problem with data encryption, and sufficient internal processes to carefully prohibit casual inspection by administrators, and to audit and monitor all such accesses when they occur. We need to start development of these controls and systems architectures now, so that they will be ready when corporate planet-earth is ready to make the jump, which is hopefully sooner rather than later.

Read on...

Realization #2: Big Data will make search better, and easier.