Back to top

Search: The Interactive Explorer of Big Data


  • Search will play a pivotal role in the next generation of interactive, exploratory data analysis applications.
  • Recent moves at Big Data market leaders enforce this direction.
  • Everybody know how to search, so using a search layer to explore data is a natural evolution in human behaviour.
  • Organizations can act now, position themselves to adopt best practices, and be leaders in this field.



Data analysis - looking for trends and insights - has traditionally been report-based, with SQL queries and other carefully crafted syntax underpinning standardized reports. In the future, the interactive exploration of data sets will become increasingly important.

Executives love their fixed-format reports, delivering the precise information they need to monitor business operations. But “knowledge workers”, whose insights provide the foundations for innovation and future corporate prosperity, find their inspiration though interacting with datasets.

Search will play the pivotal role in creating interactive, exploratory applications. In this article, we will consider some of the fundamental reasons for the shift to search.



Yonik Seeley, the father of Solr, has joined Cloudera. 

For those not familiar with Cloudera, they are the leading pure-play “Big Data” software company. Based on Apache Hadoop, their commercial software offering provides additional features, plus commercial-grade support for production systems. 

A year or so ago, “Cloudera Search” was added to their stack, it is based on Solr. Cloudera has now followed-through with this line of enquiry, and hired a leading Solr guru.

From our perspective (search engine implementation and consulting services), we’ve seen this coming. Our CEO has spoken at numerous conferences in recent years about the potential role of search in the Big Data world.

Why is search so important to Big Data? In short, search provides the best method for interactively exploring and analysing large data sets.



Search is so familiar to us that we sometimes forget why we like it. In the context of analytics, here are three very good reasons to like search, as a default access method. Search is:

  • FAST: Even where very large data sets are involved, search can deliver sub-second response times. This encourages conversation with the data.
  • FLEXIBLE: A search index provides huge flexibility – it is the ultimate schema-free data structure, imposing no hard constraints on what combination of words or data facets can be searched for, compared, or analysed.
  • FAMILIAR: Everyone understands how search works, and most of us use it on a daily basis.

Deploying search as an interactive analysis tool is simply a question of applying best practices, architectural thought, and an appropriate UI to display query results. 



The key to creating search-based analytical applications is building a great index. Some thought needs to go into the pre-processing of data, prior to indexing, and consideration should be made of issues such as data quality, cleanliness, and the normalization of terms. For example, should references to “International Business Machines” be mapped together with “IBM” for analysis purposes?

The good news is that the skills and best practices needed to build a great search index already exist. We've been doing this in the enterprise search industry for years.



Here’s an example of a simple search interface in action over a very large data set.

On the day that the Yonik Seeley / Cloudera story hit the wires, the author of this article found himself using his favourite public-facing Big Data analysis tool, Google Trends.

As a search-centric company, we love anything to do with search log files. Many of our customers have benefitted from harvesting the intelligence that search log files contain. In an effort to avoid getting distracted into that subject, let’s limit discussion of search log analysis within this article to one anecdote.

A newspaper customer back in the noughties (our contact was in charge of the Website, and took search seriously), used to hand-deliver a daily listing of the top 20 searches on the Website from the previous 12 hours to the editorial board, sitting as it did each evening in a smoke-filled room. He claimed many times (over beers) that his search logs “changed the front page” of the next morning’s printed edition.  Search logs can affect content strategy too. Nobody had guessed that “crosswords” would be a popular online distraction for readers, but search logs revealed this, and a daily online crossword puzzle was duly implemented.



Inquisitive minds ask questions, and online, the formation of a query string for submission to a search engine remains the most popular approach to finding information. A query is a straight-forward, free-form expression of a need.

Google Trends is the ultimate manifestation of query log analysis, based on billions of daily searches.  It provides a useful sounding-board for tracking and comparing. It is simple, in terms of what it does, yet powerful in terms of the insight it can deliver.

For example, in the open source search arena, interest in Elasticsearch has overtaken interest in Solr during the past year. A gut feeling may have told us this, but nothing beats empirical evidence to re-inforce, contradict, or challenge a theory.


Below (courtesy Kibana, the UI provided in the Elasticsearch “ELK” stack) is an example of how search results, delivered in response to a query over a very large data set, can be displayed in a way that encourages thought, and provoke insight.



This is the future of search. It will continue to support important business processes, from e-Commerce to research, and from  customer support to compliance. But at the same time, search will become the default access method for exploring big data.



  • Everyone knows search. It is the universal access method for checking facts, retrieving documents, or discovering what information exists on a subject. 
  • In the Big Data world, search is also set to become the universal tool for interactive analysis of very large datasets.