Back to top

Big Data and Enterprise Search: Structuring the Unstructured

The Technology Crossover Driving Acquisitions and New Applications 

As you may have noticed, there has been some acquisition activity in our industry recently which was to an extent, driven by the current enthusiasm for big data. Think Oracle / Endeca and IBM / Vivisimo. 

As a provider of enterprise search services, we’ve had numerous conversations around this subject during the past year with customers, prospects and others. Why is enterprise search technology being increasingly associated with the big data world? 

Deriving insights from large data sets is not a new idea. Large B2C companies such as food retailers have been doing this for years. They have been gaining insight from moving tins of beans to a different location in the store and measuring what happens. Online retailers use clickstream data to cross-sell more efficiently and to customize email marketing. This is a key component of Amazon’s highly successful business model. 

So what’s new? Insight from well-established applications such as these is based entirely on structured data created by automated, transactional processes. The big new idea is this:

  • 80% of the world’s data is unstructured, according to analyst consensus
  • How can we also derive actionable insight from that?

Intelligence that drives insight comes in the form of trends, graphs, pie charts and ratios. Here is a great example, courtesy of Google Trends.

However, before graphical UI software can work its magic, the unstructured world needs structuring. Adding structure to unstructured data is the foundation of gaining insight, and it matters a lot how we go about it. This is one reason why big software companies are acquiring search engine companies. The likes of Vivisimo and Endeca have mature and highly capable “indexing pipelines” which perform a number of tasks, most of which involve adding additional structure to content prior to indexing. 

This is an important part of the process. If the steps taken to add structure are inconsistent (E.g. some dates are not normalized, entity extraction is incomplete), then the accuracy of the data behind the insights becomes questionable. Smart decision makers will want to know the provenance of the data structure on which they are basing business decisions. 

The current explosion of interest in big data is being driven by the inclusion of unstructured data into the analyses. Let’s recognize that structured and unstructured data is fundamentally different:

  • Almost all structured data is produced by computers, and computers create very consistent data that is perfectly formatted. Any inaccuracies will be programmatic and easily seen /solved
  • Unstructured content is largely created by humans. Inconsistent, emotional, careless, opinionated, lazy, driven, over-worked, but always unique, humans

Appreciating this difference in the origins of the data that we seek to analyze is the first step to producing actionable insight and business advantage. 

Adding structure to the unstructured is not just about the software. This is where the crossover from enterprise search to big data matters. Technology is leading the big data trend. But given the eclectic nature of unstructured data, the application of processes, pragmatism and checksums is equally important. A focus on transparency of process creates confidence in data provenance and enables actionable insight from unstructured data. 

As IBM and Oracle will tell you, the technology elements can be bought, at a price. 

The pragmatism, transparency and best practices for implementation can also be bought, at competitive daily rates, from Search Technologies.

Blog Categories