What Is Apache Hadoop? - Databricks


Glossary Item
« Back to Glossary Index
Source Databricks

Apache Hadoop is an open-source, Java-based, software platform that manages data processing and storage for big data applications. The data is stored on commodity servers that run as clusters. It can provide a quick and reliable analysis of both structured data and unstructured data. It can scale up from single servers to thousands of machines, each offering local computation and storage. Shortly, Hadoop allows multiple concurrent tasks to run from single to thousands of servers without any delay.

How Did Hadoop Evolve?

Apache Hadoop was born out of a need to process escalating volumes of big data. Inspired by Google’s MapReduce that divides an application into small fractions to run on different nodes, Hadoop was started by Doug Cutting and Mike Cafarella in the year 2002 when they both started to work on the Apache Nutch project. A few years later Hadoop was later spun-off from that.  As a result, Yahoo released Hadoop as an open-source project in 2008. Hadoop was made available for the public in November 2012 by Apache Software Foundation.

What Comprises of Hadoop Ecosystem?

It consists of the following elements:

  • Hadoop Distributed File System (HDFS), which provides high data availability
  • Structure of Hadoop YARN, used for planning tasks and management of clusters
  • Implementation of MapReduce, based on YARN structure, for parallel processing on large datasets
  • Hadoop Common, a set of services to support the other modules.

Hadoop’s Core Components

Hadoop's Core Components

Hadoop solves two key challenges that traditional databases were struggling with:

1.  Capacity: Hadoop stores large volumes of data.

Hadoop is capable of storing and processing truly massive data sets. Furthermore, the platform is capable of managing all types of data, especially unstructured ones.

2.  Speed: Hadoop stores and retrieves data quickly.
Hadoop uses the MapReduce functional programming model to perform parallel processing across data sets. If you’re dealing with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just minutes, and petabytes in hours.

Advantages of Using Hadoop:

Hadoop offers numerous benefits to organizations that are data-driven. Here are some of them:

  • Scalability and Performance – Unlike traditional systems that have a limitation on data storage, Hadoop is scalable as it operates in a distributed environment. The flexibility and ease of scalability provided by Hadoop allow organizations to completely leverage all of their data.
  • Resilience – Hadoop is fundamentally resilient; data stored in any node is also replicated in other nodes of the cluster in preparation for future node failures. This ensures fault tolerance. If one node goes down, there is always a backup of the data available in the cluster.
  • Flexibility – unlike traditional relational database management systems, when working with Hadoop you can store data in any format, including semi-structured or unstructured formats. Hadoop enables businesses to easily access new data sources and tap into different types of data.
  • Low Cost – Hadoop is an open source platform and as no license needs to be purchased, the costs are significantly lower compared to proprietary software.



« Back to Glossary Index