What is Hadoop?Apache Hadoop is an open-source, Java-based software platform that manages data processing and storage for big data applications. Hadoop works by distributing large data sets and analytics jobs across nodes in a computing cluster, breaking them down into smaller workloads that can be run in parallel. Hadoop can process both structured and unstructured data, and scale up reliably from a single server to thousands of machines.
How Did Hadoop Evolve?Apache Hadoop was born out of a need to process escalating volumes of big data. Inspired by Google’s MapReduce, a programming model that divides an application into small fractions to run on different nodes, Doug Cutting and Mike Cafarella started Hadoop in 2002 while they were working on the Apache Nutch project. According to a New York Times article, Doug named Hadoop after his son's toy elephant. A few years later, Hadoop was spun-off from Nutch, and as a result, Yahoo released Hadoop as an open-source project in 2008. The Apache Software Foundation made Hadoop available to the public in November 2012.
Why was the introduction of Hadoop an important milestone?Hadoop made it possible for companies to analyze and query big data sets in a scalable manner using free, open source software and inexpensive, off-the-shelf hardware. This was a significant development, because it offered a viable alternative to the proprietary data warehouse solutions and closed data formats that had ruled the day until then. In doing so, Hadoop paved the way for future developments in big data analytics, like the introduction of Apache Spark.
What Makes Up The Hadoop Ecosystem?The term Hadoop is a general term that may refer to any of the following:
- The overall Hadoop ecosystem, which encompasses both the core modules and related sub-modules.
- The core Hadoop modules, including HDFS™, YARN, MapReduce, and Hadoop Common (discussed below). These are the basic building blocks of a typical Hadoop deployment.
- Hadoop related sub-modules, like Apache Hive™, Impala™, Pig™, and Zookeeper™. These related pieces of software can be used to customize, improve upon, or extend the functionality of core Hadoop.
What Are the Core Hadoop Modules?
- HDFS — Hadoop Distributed File System. HDFS allows large data sets to be distributed across nodes in a cluster in a fault-tolerant manner.
- YARN — YARN stands for Yet Another Resource Negotiator. It is used for cluster resource management, planning tasks, and scheduling jobs.
- MapReduce — MapReduce is a programming model and big data processing engine, used for parallel processing of large data sets. Originally, MapReduce was the only execution engine available in Hadoop, but later on Hadoop added support for others, including Apache Tez™ and Apache Spark™.
- Hadoop Common — Hadoop Common provides a set of services to support the other modules.
What Are Some Examples of Popular Hadoop-Related Software?Other popular packages that are not strictly a part of the core Hadoop modules, but that are frequently used in conjunction with them, include:
- Apache Hive — data warehouse software that runs on Hadoop and enables users to work with data in HDFS using a SQL-like query language called HiveQL.
- Apache Impala
- Apache Pig
- Apache Zookeeper
Advantages of Hadoop
- Scalability — Unlike traditional systems that have a limitation on data storage, Hadoop is scalable as it operates in a distributed environment. This allowed data architects to build early data lakes on Hadoop. Learn more about the history and evolution of data lakes.
- Resilience — The Hadoop Distributed File System (HDFS) is fundamentally resilient. Data stored on any node of a Hadoop cluster is also replicated on other nodes of the cluster to prepare for the possibility of hardware or software failures. This intentionally redundant design ensures fault tolerance. If one node goes down, there is always a backup of the data available in the cluster.
- Flexibility — unlike traditional relational database management systems, when working with Hadoop you can store data in any format, including semi-structured or unstructured formats. Hadoop enables businesses to easily access new data sources and tap into different types of data.
Challenges with Hadoop Architectures
- Complexity — Hadoop is a low-level, Java-based framework that can be overly complex and difficult for end users to work with. Hadoop architectures can also require significant expertise and resources to set up, maintain, and upgrade.
- Performance — Hadoop uses frequent reads and writes to disk to perform computations, which is time consuming and inefficient compared to frameworks that aim to store and process data in memory as much as possible, like Apache Spark.
- Long-term viability — Google, whose seminal 2004 paper on MapReduce underpinned the open source development of Apache Hadoop, has stopped using MapReduce altogether, as tweeted by Google SVP Urs Hölzle: “… R.I.P. MapReduce. After having served us well since 2003, today we removed the remaining internal codebase for good…” In light of this, and the availability of newer technologies, many companies are reevaluating their Hadoop investments to see whether the technology still meets their needs.
Back to glossary