A Data Lake Architecture with Hadoop and Open Source Search Engines
Using Enterprise Data Lakes for Modern Analytics and Business Intelligence
"Big data" and "data lake" only have meaning to an organization's vision when they solve business problems by enabling data democratization, re-use, exploration, and analytics. At Search Technologies, we're using big data architectures to improve search and analytics, and we're helping organizations do amazing things as a result.
What Is a Data Lake?
A data lake is a large storage repository that holds a vast amount of raw data in its native format until it is needed. An “enterprise data lake” (EDL) is simply a data lake for enterprise-wide information storage and sharing.
A data lake architecture incorporating enterprise search and analytics techniques can help companies unlock actionable insights from the vast structured and unstructured data stored in their lakes.
What Are the Benefits of a Data Lake?
The main benefit of a data lake is the centralization of disparate content sources. Once gathered together (from their “information silos”), these sources can be combined and processed using big data, search and analytics techniques which would have otherwise been impossible. The disparate content sources will often contain proprietary and sensitive information which will require implementation of the appropriate security measures in the data lake.
The security measures in the data lake may be assigned in a way that grants access to certain information to users of the data lake that do not have access to the original content source. These users are entitled to the information, yet unable to access it in its source for some reason.
Some users may not need to work with the data in the original content source but consume the data resulting from processes built into those sources. There may be a licensing limit to the original content source that prevents some users from getting their own credentials. In some cases, the original content source has been locked down, is obsolete or will be decommissioned soon; yet its content is still valuable to users of the data lake.
Once the content is in the data lake, it can be normalized and enriched. This can include metadata extraction, format conversion, augmentation, entity extraction, cross-linking, aggregation, de-normalization, or indexing. Read more about data preparation best practices. Data is prepared “as needed,” reducing preparation costs over up-front processing (such as would be required by data warehouses). A big data compute fabric makes it possible to scale this processing to include the largest possible enterprise-wide data sets.
Users, from different departments, potentially scattered around the globe, can have flexible access to the data lake and its content from anywhere. This increases re-use of the content and helps the organization to more easily collect the data required to drive business decisions.
Information is power, and a data lake puts enterprise-wide information into the hands of many more employees to make the organization as a whole smarter, more agile, and more innovative.
Searching the Data Lake
Data lakes will have tens of thousands of tables/files and billions of records. Even worse, this data is unstructured and widely varying.
In this environment, search is a necessary tool:
- To find tables that you need - based on table schema and table content
- To extract sub-sets of records for further processing
- To work with unstructured (or unknown-structured) data sets
- And most importantly, to handle analytics at scale
Only search engines can perform real-time analytics at billion-record scale with reasonable cost.
Search engines are the ideal tool for managing the enterprise data lake because:
- Search engines are easy to use – Everyone knows how to use a search engine.
- Search engines are schema-free – Schemas do not need to be pre-defined. Search engines can handle records with varying schemas in the same index.
- Search engines naturally scale to billions of records.
- Search can sift through wholly unstructured content.
The State of Data Lake Adoption
Radiant Advisors and Unisphere Research recently released "The Definitive Guide to the Data Lake," a joint research project with the goal of clarifying the emerging data lake concept.
Two of the high-level findings from the research were:
- Data lakes are increasingly recognized as both a viable and compelling component within a data strategy, with small and large companies continuing to adopt.
- Governance and security are still top-of-mind as key challenges and success factors for the data lake. For a deep-dive into data lake security and governance, read my next post.
More and more research on data lakes is becoming available as companies are taking the leap to incorporate data lakes into their overall data management strategy. It is expected that, within the next few years, data lakes will be common and will continue to mature and evolve.
Using Data Lakes in Biotech and Health Research – Two Enterprise Data Lake Examples
We are currently working with two world-wide biotechnology / health research firms. There are many different departments within these organizations and employees have access to many different content sources from different business systems stored all over the world. The data includes:
- Manufacturing data (batch tests, batch yields, manufacturing line sensor data, HVAC and building systems data);
- Research data (electronic notebooks, research runs, test results, equipment data);
- Customer support data (tickets, responses); and
- Public data sets (chemical structures, drug databases, MESH headings, proteins).
Our projects focus on making structured and unstructured data searchable from a central data lake. The goal is to provide data access to business users in near real-time and improve visibility into the manufacturing and research processes. The enterprise data lake and big data architectures are built on Cloudera, which collects and processes all the raw data in one place, and then indexes that data into a Cloudera Search, Impala, and HBase for a unified search and analytics experience for end-users.
Read about how we helped a pharmaceutical customer ingest over 1 Petabyte of unstructured data into their data lake.
Multiple user interfaces are being created to meet the needs of the various user communities. Some will be fairly simple search UIs and others will have more sophisticated user interfaces (UIs), allowing for more advanced search to be performed. Some UIs will integrate with highly specialized data analytics tools (e.g. genomic and clinical analytics). Security requirements will be respected across UIs.
Being able to search and analyze their data more effectively will lead to improvements in areas such as:
- Drug production trends – Looking for trends or drift in batches of drugs or raw materials which would indicate potential future problems (instrument calibration, raw materials quality, etc.) which should be addressed.
- Drug production comparisons – Comparing drug production and yields across production runs, production lines, production sites, or between research and production. Knowing historical data from different locations can increase size and quality of a yield. Such improvements to yields have a very high return on investment.
- Traceability – The data lake gives users the ability to analyze all of the materials and processes (including quality assurance) throughout the manufacturing process. Bio-pharma is a heavily regulated industry, so security and following industry standard practices on experiments is a critical requirement.
- Factors which contribute to yield – The data lake can help users take a deeper look at the end product quantity based on the material and processes used in the manufacturing process. For example, they can analyze how much product is produced based on raw material, labor, and site characteristics are taken into account. This helps make data-based decisions on how to improve yield by better controlling these characteristics (or how to save money if such controls don’t result in an appreciable increase in yield).
A Data Lake Architecture
All content will be ingested into the data lake or staging repository (based on Cloudera) and then searched (using a search engine such as Cloudera Search or Elasticsearch). Where necessary, content will be analyzed and results will be fed back to users via search to a multitude of UIs across various platforms. The diagram below shows an optimized data lake architecture that supports data lake analytics and search.
What’s on the Horizon?
At this point, the enterprise data lake is a relatively immature collection of technologies, frameworks, and aspirational goals. Future development will be focused on detangling this jungle into something which can be smoothly integrated with the rest of the business.
The future characteristics of a successful enterprise data lake will include:
- Common, well-understood methods and APIs for ingesting content
- Make it easy for external systems to push content into the EDL
- Provide frameworks to easily configure and test connectors to pull content into the EDL
- Corporate-wide schema management
- Methods for identifying and tracking metadata fields through business systems
- So we can track that “EID” is equal to “EMPLOYEE_ID” is equal to “CSV_EMP_ID” and can be reliably correlated across multiple business systems
- Business user’s interface for content processing
- Format conversion, parsing, enrichment, and denormalization (all common processes which need to be applied to data sets)
- Text mining
- Unstructured text such as e-mails, reports, problem descriptions, research notes, etc. are often very difficult to leverage for analysis.
- We anticipate that common text mining technologies will become available to enrich and normalize these elements.
- Integration with document management
- The purpose of ‘mining the data lake’ is to produce business insights which lead to business actions.
- It is expected that these insights and actions will be written up and communicated through reports.
- Therefore, a system which searches these reports as a precursor to analysis – in other words, a systematic method for checking prior research – will ultimately be incorporated into the research cycle.
We really are at the start of a long and exciting journey! We envision a platform where teams of scientists and data miners can collaboratively work with the corporation’s data to analyze and improve the business. After all, “information is power” and corporations are just now looking seriously at using data lakes to combine and leverage all of their information sources to optimize their business operations and aggressively go after markets.