Hadoop Cluster - Databricks

Hadoop Cluster

Glossary Item
« Back to Glossary Index
Source

A Hadoop cluster is a combination of many computers designed to work together as one system, in order to store and analyze big data (structured, semi-structured and unstructured) in a distributed computing environment. These computer clusters run Hadoop’s open source distributed processing software to achieve this task.

Hadoop clusters feature several commodity hardware that is connected together. They communicate with a high-end machine that has a master role. These master and slaves implement distributed computing over distributed data storage.

A Hadoop cluster can be scaled out by adding more nodes. It is linearly scalable, as there is a corresponding boost in throughput, for every node we add.

Hadoop Cluster Architecture

The Hadoop cluster works on a master/slave model, while the architecture is made from a data center, a rack and the node that is actually the one executing all the jobs.

Hadoop clusters are comprised of three different node types: master nodes, worker nodes, and client nodes.

  • Master nodes are the ones responsible for storing data in HDFS and overseeing the key operations such as running parallel computations on that data using MapReduce.  Master Node has 3 nodes – NameNode, Secondary NameNode and JobTracker
  • Worker nodes/ Slave nodes comprise most of the virtual machines and perform the job of storing the data and running computations. Each work node runs the DataNode and TaskTracker services which are used to receive the instructions from the master nodes
  • Client nodes are in charge of loading the data into the cluster. Client nodes first submit MapReduce jobs describing how data needs to be processed and then they fetch the results once the processing is finished.

Advantages of a Hadoop Cluster

  • As big data grows exponentially, the parallel processing power of a Hadoop cluster help to boost the speed of the analysis process
  • The cluster helps in increasing the speed of the analysis process.
  • Hadoop clusters are inexpensive to set up as they are held down by cheap commodity hardware.
  • Hadoop Clusters are resilient to failure
  • They are easily scalable.
  • Hadoop Clusters deal quickly with data from various sources and in different formats
  • It is possible to deploy Hadoop using a single-node installation, for evaluation purposes.

 

« Back to Glossary Index