Hadoop

Back to glossary

What is Hadoop?

Apache Hadoop is an open source, Java-based software platform that manages data processing and storage for big data applications. Hadoop works by distributing large data sets and analytics jobs across nodes in a computing cluster, breaking them down into smaller workloads that can be run in parallel. Hadoop can process structured and unstructured data and scale up reliably from a single server to thousands of machines. Apache Hadoop Logo

What is the history of Hadoop?

Apache Hadoop was born out of a need to process growing large volumes of big data and deliver web results faster as search engine start-ups like Yahoo and Google were getting off the ground. Inspired by Google’s MapReduce, a programming model that divides an application into small fractions to run on different nodes, Doug Cutting and Mike Cafarella started Hadoop in 2002 while working on the Apache Nutch project. According to a New York Times article, Doug named Hadoop after his son's toy elephant. A few years later, Hadoop was spun off from Nutch. Nutch focused on the web crawler element, and Hadoop became the distributed computing and processing portion. Two years after Cutting joined Yahoo, Yahoo released Hadoop as an open source project in 2008. The Apache Software Foundation (ASF) made Hadoop available to the public in November 2012 as Apache Hadoop.

What is the impact of Hadoop?

Hadoop was a major development in the big data space. In fact, it is credited with being the foundation for the modern cloud data lake. Hadoop democratized computing power and made it possible for companies to analyze and query big data sets in a scalable manner using free, open source software and inexpensive, off-the-shelf hardware. This was a significant development because it offered a viable alternative to the proprietary data warehouse (DW) solutions and closed data formats that had ruled the day until then. With the introduction of Hadoop, organizations quickly had access to the ability to store and process huge amounts of data, increased computing power, fault tolerance, flexibility in data management, lower costs compared to DW’s, and greater scalability -  just keep on adding more nodes. Ultimately, Hadoop paved the way for future developments in big data analytics, like the introduction of Apache Spark™.

What is the Hadoop ecosystem?

The term Hadoop is a general term that may refer to any of the following:
  • The overall Hadoop ecosystem, which encompasses both the core modules and related sub-modules.
  • The core Hadoop modules, including Hadoop Distributed File System (HDFS™), Yet Another Resource Negotiator (YARN), MapReduce, and Hadoop Common (discussed below). These are the basic building blocks of a typical Hadoop deployment.
  • Hadoop-related sub-modules, including: Apache Hive™, Apache Impala™, Apache Pig™,  and Apache Zookeeper™, among others. These related pieces of software can be used to customize, improve upon, or extend the functionality of core Hadoop.

What are the core Hadoop modules?

Components of Hadoop Core: HDFS, YARN, MapReduce, Hadoop Common
  • HDFS — Hadoop Distributed File System. HDFS is a Java-based system that allows large data sets to be stored across nodes in a cluster in a fault-tolerant manner.
  • YARN — Yet Another Resource Negotiator. YARN is used for cluster resource management, planning tasks, and scheduling jobs that are running on Hadoop.
  • MapReduceMapReduce is both a programming model and big data processing engine used for the parallel processing of large data sets. Originally, MapReduce was the only execution engine available in Hadoop, but later on, Hadoop added support for others, including Apache Tez™ and Apache Spark™.
  • Hadoop Common — Hadoop Common provides a set of services across libraries and utilities to support the other Hadoop modules.

What are some examples of popular Hadoop-related software?

Other popular packages that are not strictly a part of the core Hadoop modules but that are frequently used in conjunction with them include:
  • Apache Hive is data warehouse software that runs on Hadoop and enables users to work with data in HDFS using a SQL-like query language called HiveQL.
  • Apache Impala is the open source, native analytic database for Apache Hadoop.
  • Apache Pig is a tool that is generally used with Hadoop as an abstraction over MapReduce to analyze large sets of data represented as data flows. Pig enables operations like join, filter, sort, load, etc.
  • Apache Zookeeper is a centralized service for enabling highly reliable distributed processing.
  • Apache Sqoop™ is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
  • Apache Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
Read more about the Hadoop ecosystem.

What are the benefits of Hadoop?

  • Scalability — Unlike traditional systems that limit data storage, Hadoop is scalable as it operates in a distributed environment. This allowed data architects to build early data lakes on Hadoop. Learn more about the history and evolution of data lakes.
  • Resilience — The Hadoop Distributed File System (HDFS) is fundamentally resilient. Data stored on any node of a Hadoop cluster is also replicated on other nodes of the cluster to prepare for the possibility of hardware or software failures. This intentionally redundant design ensures fault tolerance. If one node goes down, there is always a backup of the data available in the cluster.
  • Flexibility — unlike traditional relational database management systems, when working with Hadoop, you can store data in any format, including semi-structured or unstructured formats. Hadoop enables businesses to easily access new data sources and tap into different types of data.

What are the challenges with Hadoop architectures?

  • Complexity — Hadoop is a low-level, Java-based framework that can be overly complex and difficult for end-users to work with. Hadoop architectures can also require significant expertise and resources to set up, maintain, and upgrade.
  • Performance — Hadoop uses frequent reads and writes to disk to perform computations, which is time-consuming and inefficient compared to frameworks that aim to store and process data in memory as much as possible, like Apache Spark™.
  • Long-term viability — In 2019, the world saw a massive unraveling within the Hadoop sphere. Google, whose seminal 2004 paper on MapReduce underpinned the creation of Apache Hadoop, stopped using MapReduce altogether, as tweeted by Google SVP of Technical Infrastructure, Urs Hölzle. There were also some very high-profile mergers and acquisitions in the world of Hadoop. Furthermore, in 2020, a leading Hadoop provider shifted its product set away from being Hadoop-centric, as Hadoop is now thought of as “more of a philosophy than a technology.”  Lastly, 2021 has been a year of interesting changes. In April 2021, the Apache Software Foundation announced the retirement of ten projects from the Hadoop ecosystem. Then in June 2021, Cloudera agrees to private. The impact of this decision on Hadoop users is still to be seen. This growing collection of concerns paired with the accelerated need to digitize has encouraged many companies to re-evaluate their relationship with Hadoop.
Read more about challenges with Hadoop, and the shift toward modern data platforms, in the blog entitled It’s Time to Re-evaluate Your Relationship with Hadoop.

Additional Resources


Back to glossary