Ingesting Unstructured Content into a Data Lake at Scale
How we acquired over 1 Petabyte of data in a recent client project
Over the past few years, we have worked with several clients to develop data lakes for acquiring enterprise-wide content. Once the content is in the data lake, it can be searched and enriched to generate insights regarding business problems or support a myriad of business needs.
In a previous webinar on the topic of data lakes, Paul Nelson and Alex Olaru demonstrated a way to use Cloudera Morphlines for ingesting structured content into a Cloudera data lake. In a recent project for a pharmaceutical client, we tackled a different problem using Aspire for Big Data: ingesting over one petabyte (PB) of unstructured content into their data lake. Just to put this into perspective, according to a Computer Weekly article, "one petabyte is enough to store the DNA of the entire population of the US – and then clone them, twice."
What is Unstructured Content?
Unstructured content refers to text documents written “by humans for humans,” such as memos, emails, research and regulatory reports, and the content of social media posts. Unstructured content generally lacks a predefined data model to describe their contents. The absence of consistent descriptive metadata poses challenges for applications seeking to generate insights from this content.
Fortunately, software patterns for acquiring and enriching unstructured text with metadata are available through connectors and tagging processes. However, enrichment involves careful planning to extract the text from the content and generate metadata to help software applications (like search engines and big data jobs) make sense of the information.
The Business Problem
Our client has over 1 PB of research data stored in a variety of systems ranging from Window and Unix (NFS) file shares, Documentum, and SharePoint. A wide range of content types is found in these sources such as PDFs, MS Office documents, email .msg files, and even images and miscellaneous text formats. The data lake for this project is hosted in a Cloudera cluster.
Typically, our pharmaceutical clients want the ability to search and run analytics over unstructured content to derive insights from past research, respond to regulatory compliance requests, and fulfill a myriad of other needs. Our immediate business problem was “How can this content be stored in the data lake in a manner that will be useful in addressing these needs?”
Making Sense of Unstructured Content in a Data Lake
In this project, we tackled two important problems related to unstructured content found in many data lake projects:
- How to simplify data lake ingestion, especially for large volumes of unstructured content
- How to ensure the content can be reused and repurposed within the data lake
Our solution embeds our Aspire Content Processing Framework into the data lake as a Cloudera Service. While we are discussing a particular pharmaceutical use case in this post, the solution can be extended to a wide range of domains and use cases.
Simplifying Ingest by Embedding Aspire into Cloudera as a Service
Deployed as a Cloudera Parcel, Aspire resides and runs in the Cloudera Cluster as a Cloudera Service, yielding several benefits. Notably, we used Cloudera for this particular project but the tools and patterns can be applied to other data lake platforms as well. The data flows for this system are depicted in Figure 1.
Aspire is deployed as a Cloudera service
Aspire 3.2 can be deployed as a Cloudera Service and can communicate with the data lake for content storage and indexing natively.
- Benefits: Aspire instances can be deployed across many nodes in the cluster to improve performance.
- Aspire leverages the same authentication protocols defined for the cluster such as Kerberos and LDAP; simplifying integration and satisfying data lake security requirements.
First-class unstructured connectors can be deployed in a Hadoop cluster
The Aspire Connector framework takes over responsibility for acquiring content.
- Purpose-built connectors can acquire binaries, metadata, and access control lists related to content in enterprise data systems (PDFs, Office documents, lab notebook reports).
- Unlike other ETL models, external systems are not responsible for pushing content to the data lake. External systems do not need to be “data lake” aware.
New connectors can be added to the data lake as needed. For instance, as of early January 2018, connectors for the following sources are compatible with Aspire for Big Data: Documentum, File shares (CIFS and NFS), SharePoint, relational databases via JDBC drivers, Impala, Hive, and Kafka.
Scalability through parallel content acquisition
Aspire can be deployed as a service across MANY nodes in the cluster
- This means that MANY copies of the connector can run in parallel; delivering high-performance throughput rates.
Figure 1. Dataflow Overview
Supporting Efficient Content Re-Use with the Staging Repository
A staging repository is central to this architecture.
- The ingest stage uses connectors to acquire content and publishes it to the staging repository
- The indexing stage picks up the content from the repository and supports indexing or publishing to other sources.
For several years now, we have been using staging repositories to augment content ingest in what we call a "crawl-once and reprocess as needed" approach.
Let’s discuss how a data lake project can make use of this architecture.
Ingest workflow and the staging repository
First, the ingest workflow acquires the content, performs light processing such as text extraction, and then we store EVERYTHING we captured, including metadata, access control lists, and the extracted full-text of the content in JSON and place it in the NoSQL staging repository. Binary files (such as PDFs) can be stored in the data lake as well to support future use cases. Connectors can use incremental update procedures to update the content in the data lake as it changes in the enterprise data sources.
A NoSQL store such as MongoDB or HBase is preferred for the Aspire Staging Repository.
Figure 2. Ingest Dataflow
A repository deployed in this fashion can save days and weeks of processing time.
In some cases, it can take weeks to ingest content due to performance limitations or access restrictions in native sources. Storing the content in the repository allows us to CRAWL once and then repurpose the content as needed in the indexing workflow.
A search engine can support all stages of a data lake project. However, a search engine plays a critical role in the early stages of the data lake project. Initially, our customer indexed data lake content with search engines just to understand what was ingested. In later stages, they created highly curated indexes to support specific use cases.
A first cut analysis may support a review of the content in multiple dimensions:
- By type: In many cases, a single unstructured content source may include different types of content (reports, memos, or lab experiments), as well as some low-value information like logs or working files.
- By use case: Content may also support different use cases.
- By provenance/authorship: Knowing the department, lab, or researcher who generated content is essential.
- By publishing/creation date: Analyzing data lake content by date helps understand trends and developments over time.
A Cloudera implementation is perfectly suited to support a large number of Solr indexes to meet diverse use cases.
Figure 3. Indexing Dataflow
Repurposing and enriching content
As depicted in the indexing workflow diagram (Figure 3); a search engine embedded in a big data framework can generate numerous search engine indexes. Each index can be designed specifically to meet the needs of specific user communities within the enterprise and allows the client to view the content from a variety of perspectives.
The value of the content in the staging repository increases as we know more about it. The content can be classified and tagged over time using ontologies created by our customers or from external domain-specific sources and APIs.
The data lake staging repository supports this use case nicely with the addition of an Aspire enrichment workflow. An enrichment workflow (Figure 4) passes content through REST APIs or other mechanisms to classify documents and perform named entity recognition using domain-specific ontologies or local terminologies.
The workflow can be triggered by one of two scenarios:
- A new document arrives in the repository and needs enrichment.
- An ontology is updated and therefore the previous tagging is “out of date” and must be repeated so that old content can benefit from new terms.
As a benefit, enriching content in the staging repository saves time and ensures continuously updated content. While this project provided a framework for a pharmaceutical client, the pattern can be applied in multiple domains for a wide variety of industries.
Figure 4. Aspire Enrichment Dataflow
Aspire for Big Data provides the missing piece for customers seeking to make use of unstructured content in data lake projects. I hope this blog post demonstrates some of the benefits of our approach.
- Watch my on-demand webinar for a detailed discussion on this topic.
- Contact us to learn more about how you can leverage these technologies and approaches for your organization.