Big Data Analysis Techniques: Creating New Perspectives
Gartner, IBM and many others have asserted that 80% or more of the world’s information is unstructured – and inherently hard to analyze. What does that mean? And what is required to extract insight from unstructured data?
Very few of the items that we generally call “content” are completely unstructured. For example, even the most casual of Word documents, consigned haphazardly to a lawless file share, has some inherent structure to it. So in reality, all content is semi-structured.
But for the purpose of this blog, let’s stick with the familiar “unstructured” label.
DEFINING DATA AND CONTENT
A useful way of distinguishing between structured data and unstructured content is to consider how it was made:
- Structured data is created by computers (log files, transaction records, RFID events, etc.)
- Unstructured content is generally created by humans
The means of manufacture dictates the properties.
- Computer-generated DATA are 100% consistent, normalized and predictable
- Unstructured CONTENT is infinitely variable in quality and format, because it is produced by humans who can be lazy, fastidious, conscientious, over-worked, unpredictable, ill-informed, highly motivated, or even cynical, but always unique
The value of data for analysis purposes has been recognized and exploited for twenty years by the retail and financial sectors. A range of techniques have been developed, established, and fine-honed for analyzing structured data.
Within the current wave of enthusiasm for big data, two things are genuinely new:
- The availability of elastically-scalable commodity computer hardware (at affordable prices), in combination with Hadoop, provides a platform with which almost any organization can ask bigger questions*
- The notion that unstructured content can be included in the analysis
In the established world of deriving insight from structured data, a skills shortage is the main restriction to progress (Search Technologies can help, see Hadoop Implementation Services).
The key issue with unstructured content is that algorithms can only analyze structured data. If we want to include unstructured content into analyses (and there are good reasons to do so), first we need to add structure and it matters a lot how that structure is added.
VOLUME, VELOCITY, VARIETY, VERACITY
IBM can be credited with originating this description of the properties of big data. When it comes to unstructured content, variety and veracity (meaning uncertainty – but that does not begin with V...) are the key challenges. They are addressed through adding structure and through normalization, but this needs to be done thoughtfully.
Adding structure involves text processing, such as entity extraction, sentiment analysis, and categorization. These are developing sciences. The cheap processing power that elastic computing / Hadoop can provide will generate much innovation in text analysis during the next few years. But, it will never be an exact science, unlike the analysis of structured data. A different mindset and approach is needed where unstructured content is involved (see a 2009 blog about this concept, written by our Chief Scientist in 2009, Why Text Search Programming is Different).
CREATING PERSPECTIVES TO ANALYSES
With almost any kind of challenge, the ability to look at a problem from multiple perspectives is useful. Through text analysis, new perspectives can be created for the big data world. Here are two simple examples:
Sentiment: A brand owner knows from structured data that sales of a product recently took a hit. Through analyzing social media and other unstructured content, they are able to identify key trends in sentiment and the root causes of the problem, and then formulate action plans to address the issues identified.
Detail: A recruitment company closely monitors the rate at which employment opportunities and resumes are uploaded to their jobs portal. They understand the overall seasonal trends very well, and have done so for years. Through text analytics, they can now slice and dice this data to examine regional and industry trends. The intelligence gained guides marketing expenditure, and helps them to focus on “hot” sectors, and cities.
SLICE AND DICE
Unstructured content, if appropriately and carefully analyzed, provides multiple new dimensions with which to view analytical challenges. The more perspectives available, the clearer the view, provided that confidence in the data is maintained.
This approach is relatively new to the analytics world, but in the enterprise search space we’ve been doing it for more than a decade. In the search context, this means:
- Carefully analyzing unstructured content and extracting key concepts
- Then presenting new perspectives as navigation options, enabling search users to interactively drill down into, and personalize large results sets
The same approach works with big data analytics, and involves the same combination of technology, skills, and proven processes.
Through our many customer enagements covering both search and analytical applications, Search Technologies has found that this combination of technology, expertise, and proven processes consistently delivers successful projects.
* Ask bigger questions is a Cloudera slogan. Search Technologies is an authorized Cloudera implementation partner.