Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running in clustered systems. It is at the center of a growing ecosystem of big data technologies that are primarily used to support advanced analytics initiatives, including predictive analytics, data mining and machine learning applications. Hadoop can handle various forms of structured and unstructured data, giving users more flexibility for collecting, processing and analyzing data than relational databases and data warehouses provide.
Formally known as Apache Hadoop, the technology is developed as part of an open source project within the Apache Software Foundation (ASF). Commercial distributions of Hadoop are currently offered by four primary vendors of big data platforms: Amazon Web Services (AWS), Cloudera, Hortonworks and MapR Technologies. In addition, Google, Microsoft and other vendors offer cloud-based managed services that are built on top of Hadoop and related technologies.
Hadoop and big data
Hadoop runs on clusters of commodity servers and can scale up to support thousands of hardware nodes and massive amounts of data. It uses a namesake distributed file system that’s designed to provide rapid data access across the nodes in a cluster, plus fault-tolerant capabilities so applications can continue to run if individual nodes fail. Consequently, Hadoop became a foundational data management platform for big data analytics uses after it emerged in the mid-2000s.
Hadoop was created by computer scientists Doug Cutting and Mike Cafarella, initially to support processing in the Nutch open source search engine and web crawler. After Google published technical papers detailing its Google File System (GFS) and MapReduce programming framework in 2003 and 2004, respectively, Cutting and Cafarella modified earlier technology plans and developed a Java-based MapReduce implementation and a file system modeled on Google’s.
In early 2006, those elements were split off from Nutch and became a separate Apache subproject, which Cutting named Hadoop after his son’s stuffed elephant. At the same time, Cutting was hired by internet services company Yahoo, which became the first production user of Hadoop later in 2006. (Cafarella, then a graduate student, went on to become a university professor.)
Use of the framework grew over the next few years, and three independent Hadoop vendors were founded: Cloudera in 2008, MapR a year later and Hortonworks as a Yahoo spinoff in 2011. In addition, AWS launched a Hadoop cloud service called Elastic MapReduce in 2009. That was all before Apache released Hadoop 1.0.0, which became available in December 2011 after a succession of 0.x releases.
Components of Hadoop
The core components in the first iteration of Hadoop were MapReduce, the Hadoop Distributed File System (HDFS) and Hadoop Common, a set of shared utilities and libraries. As its name indicates, MapReduce uses map and reduce functions to split processing jobs into multiple tasks that run at the cluster nodes where data is stored and then to combine what the tasks produce into a coherent set of results. MapReduce initially functioned as both Hadoop’s processing engine and cluster resource manager, which tied HDFS directly to it and limited users to running MapReduce batch applications.
That changed in Hadoop 2.0, which became generally available in October 2013 when version 2.2.0 was released. It introduced Apache Hadoop YARN, a new cluster resource management and job scheduling technology that took over those functions from MapReduce. YARN — short for Yet Another Resource Negotiator but typically referred to by the acronym alone — ended the strict reliance on MapReduce and opened up Hadoop to other processing engines and various applications besides batch jobs.
The Hadoop 2.0 series of releases also added high availability (HA) and federation features for HDFS, support for running Hadoop clusters on Microsoft Windows servers and other capabilities designed to expand the distributed processing framework’s versatility for big data management and analytics.
Hadoop 3.0.0 was the next major version of Hadoop. Released by Apache in December 2017, it didn’t expand Hadoop’s set of core components. However, it added a YARN Federation feature designed to enable YARN to support tens of thousands of nodes or more in a single cluster, up from a previous 10,000-node limit. The new version also included support for GPUs and erasure coding, an alternative to data replication that requires significantly less storage space.
Hadoop applications and use cases
Hadoop is primarily geared to analytics uses, and its ability to process and store different types of data makes it a particularly good fit for big data analytics applications. Big data environments typically involve not only large amounts of data, but also various kinds, from structured transaction data to semistructured and unstructured forms of information, such as internet clickstream records, web server and mobile application logs, social media posts, customer emails and sensor data from the internet of things (IoT).
A common use case for Hadoop-based big data systems is customer analytics. Examples include efforts to predict customer churn, analyze clickstream data to better target online ads to web users and track customer sentiment based on comments about a company on social networks. Insurers use Hadoop for applications such as analyzing policy pricing and managing safe driver discount programs. Healthcare organizations look for ways to improve treatments and patient outcomes with Hadoop’s aid.
YARN greatly expanded the applications that Hadoop clusters can handle to include stream processing and real-time analytics applications run in tandem with processing engines, like Apache Spark and Apache Flink. For example, some manufacturers are using real-time data that’s streaming into Hadoop in predictive maintenance applications to try to detect equipment failures before they occur. Fraud detection, website personalization and customer experience scoring are other real-time use cases.
Because Hadoop can process and store such a wide assortment of data, it enables organizations to set up data lakes as expansive reservoirs for incoming streams of information. In a Hadoop data lake, raw data is often stored as is so data scientists and other analysts can access the full data sets if need be; the data is then filtered and prepared by analytics or IT teams as needed to support other applications.
Data lakes generally serve different purposes than traditional data warehouses that hold cleansed sets of transaction data. But, in some cases, companies view their Hadoop data lakes as modern-day data warehouses. Either way, the growing role of big data analytics in business decision-making has made effective data governance and data security processes a priority in data lake deployments.
Evolution of the Hadoop market
In addition to AWS, Cloudera, Hortonworks and MapR, several other IT vendors — most notably, IBM, Intel and Pivotal (a Dell Technologies subsidiary) — entered the Hadoop distribution market. However, those three companies all later dropped out and aligned themselves with one of the remaining vendors after failing to make much headway with Hadoop users. Intel dropped its distribution and invested in Cloudera in 2014, while Pivotal and IBM agreed to resell the Hortonworks version in 2016 and 2017, respectively.
Even the remaining vendors have hedged their bets on Hadoop itself by expanding their big data platforms to also include Spark and numerous other technologies. Spark, which runs both batch and real-time workloads, has ousted MapReduce in many batch applications and can bypass HDFS to access data from Amazon Simple Storage Service (S3) in the AWS cloud — a capability supported by Cloudera and Hortonworks, as well as AWS itself. In 2017, both Cloudera and Hortonworks dropped the word Hadoop from the names of their rival conferences for big data users.
Nonetheless, the overall big data ecosystem — or the Hadoop ecosystem, as it’s also still known — continues to attract the attention of users and vendors alike. And, increasingly, the focus is on cloud deployments. To compete with Amazon EMR, as Elastic MapReduce is now called, Cloudera, Hortonworks and MapR have all taken steps to make it easier to deploy and manage their platforms in the cloud, including support for transient clusters that can be shut down when no longer needed.
Organizations looking to use Hadoop in the cloud can also turn to a variety of managed services, including Microsoft’s Azure HDInsight, which is based on the Hortonworks platform, and Google Cloud Dataproc, which is built around the open source versions of both Hadoop and Spark.
Big data tools associated with Hadoop
The ecosystem that has been built up around Hadoop includes a range of other open source technologies that can complement and extend its basic capabilities. The list of related big data tools includes:
- Apache Flume: a tool used to collect, aggregate and move huge amounts of streaming data into HDFS;
- Apache HBase: a distributed database that is often paired with Hadoop;
- Apache Hive: an SQL-on-Hadoop tool that provides data summarization, query and analysis;
- Apache Oozie: a server-based workflow scheduling system to manage Hadoop jobs;
- Apache Phoenix: an SQL-based massively parallel processing (MPP) database engine that uses HBase as its data store;
- Apache Pig: a high-level platform for creating programs that run on Hadoop clusters;
- Apache Sqoop: a tool to help transfer bulk data between Hadoop and structured data stores, such as relational databases; and
- Apache ZooKeeper: a configuration, synchronization and naming registry service for large distributed systems.