Friday, July 19, 2013

Components of Hadoop

The components of a running Hadoop cluster consist of a set of daemons. Some of these run on single server whereas some run across multiple servers. These daemons include:
  1. Namenode
  2. Secondary Namenode
  3. Datanode
  4. Jobtracker
  5. Tasktracker
As seen in the previous article we know that a Hadoop cluster consist of two types of operating nodes viz. Namenode and Datanode

Namenode

The Namenode is responsible for managing filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories on the HDFS cluster. A namespace image file and an edit log file  on the local disk stores this information. The namenode knows the datanodes on which all the blocks for a given file are located, however, it does not store block locations persistently, since this information is reconstructed from datanodes when the system starts. A client accesses the filesystem on behalf of the user by communicating with the Namenode and datanodes.


A Namenode is a single point of failure of the Hadoop cluster. It is therefore necessary to make Namenode fault tolerant. There are two ways of doing this.The first way is to configure Hadoop so that it stores backup of the persistent state of filesystem metadata to multiple filesystem. The second way is using Secondary  Namenode.

Secondary Namenode:

Secondary Namenode periodically merges the namespace image with the edit log and maintain a copy of this namespace image. It usually runs on a seperate machine. However the Secondary Namenode lags in state with the primary Namenode, hence in case of failure of primary Namenode some data loss occurs for sure.


Datanode:

The Datanodes act as the work horses of the filesystem. They store and retrieve blocks when requested by clients or the namenode. A Datanode reports the Namenode with the lists of blocks that are stored on it.


All the above daemons are called as storage daemons, since they handle operations related to storage of files on HDFS.The storage daemons follow the master-slave architecture with the Namenode acting as master and Datanodes acting as slaves. Now we'll see compute daemons. They also follow master-slave architecture with Jobtracker acting as master and Tasktrackers acting as slaves.

Jobtracker: 

A Jobtracker coordinates all the jobs that are run on the system by scheduling each task to run on tasktrackers. It is the responsibility of Jobtracker to reschedule a failed task on a different Tasktracker.

Tasktracker:

Tasktrackers run tasks allocated to them and send progress reports to the jobtracker, that keeps a record of the overall progress of each job.

The diagram below shows the topology of a Hadoop cluster:



master slave architecture of hadoop cluster topology

 


Protected by Copyscape Online Infringement Checker

Sunday, July 7, 2013

Comparison of Hadoop with other systems


In my previous post I explained what is Hadoop and the need for Hadoop. In this post I'll compare Hadoop with other existing systems.

Hadoop vs RDBMS:


Disk latency has not improved proportionally to disk bandwidth i.e seek time has not improved proportionally to transfer time. RDBMS uses B-tree for data access which is limited by disk latency, therefore it would take large time to access majority of data. Hadoop uses MapReduce model for data access which is limited by disk bandwidth. Hence for queries involving majority of database B-tree is less effecient than MapReduce.  

RDBMS is more efficient for point queries where data is indexed to improve disk latency. Whereas Hadoop's Mapreduce is more efficient for queries involving complete data. Moreover Mapreduce suits applications in which data is written once and read many times, whereas in RDBMS dataset is continuously updated.

                                                        MapReduce                                      RDBMS

  1. Size of data                              Petabytes                                         Gigabytes
  2. Integrity of data                           Low                                                  High
  3. Data schema                            Dynamic                                              Static
  4. Access method                Interactive and Batch                                   Batch
  5. Scaling                                       Linear                                             Nonlinear
  6. Data structure                        Unstructured                                       Structured
  7. Normalization of data              Not Required                                       Required
These difference are likely to blur in near future.

Hadoop vs Grid Computing:


Grid Computing has been doing large scale processing by dividing a job over a cluster of systems. But it is efficient for only compute intensive jobs. For data intensive jobs huge data has to be transferred over the network and the network bandwidth becomes the bottleneck. This is where Hadoop outperforms Grid Computing. Mapreduce tries to locate computations on the node where data resides thus saving network bandwidth. This is called as principle of locality which lies at the heart of MapReduce.

Moreover Mapreduce saves the programmers from writing code for node failure and handling data flow as these are handled implicitly by MapReduce.Whereas Grid Computing provides great control to handle data flow and node failures.

Thus we can say that Hadoop is not a replacement for RDBMS and both these systems can coexist simultaneously.

Protected by Copyscape Duplicate Content Software

Saturday, July 6, 2013

What is Hadoop

Need of Hadoop:


Big Data:

Today we are surrounded by data, infact it won't be wrong to say that we live in data age. The amount of data is increasing exponentially. As the data is increasing it is becoming more and more challenging for organisations to maintain and analyse this huge amount of data. The success of an organisation largely depends on their ability to extract valuable information from this huge amount of data. Hadoop uses the approach of scaling out rather than scaling up to deal with this exploding data i.e using more systems of computer rather than bigger computer systems.

Data Storage:

The access speeds of hard drives have not increased proportionally to their storage capacities over the years. As a result it takes hours to read an entire hard disk and even more time to perform write operations. However this problem can be solved by dividing the data over multiple hard drives and parallely reading the data from these hard drives. 

Parallel read and write operations raises new issues like
  1. Need to handle hardware failures: Hadoop has its own distributed filesystem called HDFS which deals with hardware failures by data replication. We'll learn more about HDFS in upcoming posts .
  2. Ability to combine data from different drives: Most of the analysis will require data from different  hard drives. Hadoop uses MapReduce programming model which abstracts this problem by tranforming it into computations over key and value pair. We'll learn this programming model in upcoming posts. For now all you need to know is that there are two phases of computation Mapping and Reducing. Mixing occurs at the interface between these two phases. 
Thus in short we can say that Hadoop provides us with two components HDFS and MapReduce that provides reliable shared storage and analysis system.

Hadoop Introduction:


Hadoop is a framework for implementing distributed computing to process big data.Some of the key features of Hadoop are

  1. Accessibility: Hadoop runs on large clusters of commodity hardware.
  2. Robustness: Hadoop handles failures by replication of data.
  3. Scalability: Hadoop scales up linearly.
  4. Simplicity: Hadoop allows users to write parallel programs quickly.

The image below shows how users interacts with a Hadoop cluster.


Client interaction with Hadoop cluster

In my next post I'll show where Hadoop stands in terms of comparison with other systems.

Protected by Copyscape Web Plagiarism Scanner