The Fundamentals of Hadoop and MapReduce
It is increasingly important for programmers to be familiar with basic Big Data techniques, and for business folks to understand what Hadoop does and does not do.
THE BIG DATA PROBLEM
We live in the age of big data. It is not easy to measure the total volume of data stored electronically, but a recent analyst estimate puts the size of the “The Digital Universe” at 4.4 zettabytes in 2013. The same firm is forecasting a tenfold growth by 2020 to 44 zettabytes - which in more familiar terms is 44 billion terabytes. This flood of data is coming from a myriad of sources. For example:
- The New York Stock Exchange generates about four terabytes of new trade data per day.
- Facebook hosts approximately 240 billion pictures, growing seven petabytes per month.
- Ancestry.com, the genealogy site, stores around 10 petabytes of data.
- The Internet Archive stores around 18.5 petabytes of data.
- The Large Hadron Collider processes one petabyte of data every day - the equivalent of around 210,000 DVDs.
ACCESS TIME IS A KEY ISSUE
The problem is that although the storage capacities of hard drives have increased massively over the years, access speeds - the rates at which data can be read from drives - have not kept up. A typical drive from 1990 could store 1.4 MB of data, and had a transfer speed of 4.4 MB/s, so you could read all the data from a full drive in around five minutes. Today, more 20 years later, one terabyte drives are commonplace, but with typical transfer speeds of around 100 MB/s, it takes more than two and a half hours to read all of the data.
The obvious way to reduce access times is to read from multiple disks at once. Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could then read that terabyte of data in less than two minutes.
Using only one hundredth of a disk seems wasteful. However, we can of course store many datasets on the disk array. We can imagine that the users of such a system would be happy to share access in return for shorter read-times, and statistically, that read requests will likely be spread over time, so they won't interfere with each other too much. This is what big data storage systems such as HDFS (the Hadoop Distributed File System) provide.
But there is more to Big Data than simply being able to access it conveniently.
Hadoop MapReduce is a software framework for creating applications that process large datasets in parallel, on clusters of commodity hardware in a reliable, fault-tolerant manner.
A MapReduce job splits ("maps") the input dataset into independent chunks, which are processed on different nodes in the cluster, in a parallel manner. The framework then sorts the outputs of each chunk, which are merged ("reduced") back together. Usually, both the input and the output of the job are stored in a distributed filesystem.
Typically the compute nodes doing the processing and the distributed storage nodes are the same commodity servers. In other words, the MapReduce framework and the Hadoop Distributed File System (HDFS) are running on the same cluster of servers. This configuration enables the framework to efficiently schedule tasks on the nodes where data is already present, resulting in high aggregate transfer bandwidth across the cluster.
That, in a nutshell, is what Hadoop provides; a reliable, scalable platform for the storage and analysis of large datasets. Further, rather than relying on hardware robustness to deliver high-availability, the system is designed to detect and handle hardware failures, thus delivering a highly-available service even when using cheap, commodity servers.
A FRAMEWORK FOR MANY APPLICATIONS
The use of Hadoop is most strongly associated with data analysis, and applications that provide insight to support decisions. However, any application that requires the storage and processing of large data sets can benefit from Hadoop. For example, at Search Technologies we are finding Hadoop to be a valueable tool for creating highly accurate search applications.
ENGAGING WITH HADOOP
The Hadoop framework is implemented in Java, and MapReduce applications can be developed in Java or any JVM-based language. It is designed to scale up from a single server to thousands of machines, each offering local computation and storage. What’s more, Hadoop is affordable since it runs on commodity hardware, and is an open source framework. Commercial packaging and support for production systems is available from companies such as Cloudera.
If you know Java, try our free-to-download Hadoop Tutorial, which will help you to create and run your first MapReduce job. If you have an application in mind and need some technical expertise, feel free to contact us for an informal discussion. Search Technologies provides expert implementation services for Hadoop projects.