The Hadoop Tutorial: Components of Hadoop

The components of a running Hadoop cluster consist of a set of daemons. Some of these run on single server whereas some run across multiple servers. These daemons include:

Namenode
Secondary Namenode
Datanode
Jobtracker
Tasktracker

As seen in the previous article we know that a Hadoop cluster consist of two types of operating nodes viz. Namenode and Datanode

Namenode:

The Namenode is responsible for managing filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories on the HDFS cluster. A namespace image file and an edit log file on the local disk stores this information. The namenode knows the datanodes on which all the blocks for a given file are located, however, it does not store block locations persistently, since this information is reconstructed from datanodes when the system starts. A client accesses the filesystem on behalf of the user by communicating with the Namenode and datanodes.

A Namenode is a single point of failure of the Hadoop cluster. It is therefore necessary to make Namenode fault tolerant. There are two ways of doing this.The first way is to configure Hadoop so that it stores backup of the persistent state of filesystem metadata to multiple filesystem. The second way is using Secondary Namenode.

Secondary Namenode:

Secondary Namenode periodically merges the namespace image with the edit log and maintain a copy of this namespace image. It usually runs on a seperate machine. However the Secondary Namenode lags in state with the primary Namenode, hence in case of failure of primary Namenode some data loss occurs for sure.

Datanode:

The Datanodes act as the work horses of the filesystem. They store and retrieve blocks when requested by clients or the namenode. A Datanode reports the Namenode with the lists of blocks that are stored on it.

All the above daemons are called as storage daemons, since they handle operations related to storage of files on HDFS.The storage daemons follow the master-slave architecture with the Namenode acting as master and Datanodes acting as slaves. Now we'll see compute daemons. They also follow master-slave architecture with Jobtracker acting as master and Tasktrackers acting as slaves.

Jobtracker:

A Jobtracker coordinates all the jobs that are run on the system by scheduling each task to run on tasktrackers. It is the responsibility of Jobtracker to reschedule a failed task on a different Tasktracker.

Tasktracker:

Tasktrackers run tasks allocated to them and send progress reports to the jobtracker, that keeps a record of the overall progress of each job.

The diagram below shows the topology of a Hadoop cluster:

master slave architecture of hadoop cluster topology

The Hadoop Tutorial

Friday, July 19, 2013

Components of Hadoop