Cloud Computing:Hadoop

June 21, 2013

1. Apache Hadoop

1.1. Overview

Apache Hadoop is a software solution for distributed computing of large datasets. Hadoop provides a distributed filesystem (HDFS) and a MapReduce implementation.

A special computer acts as the "name node". This computer saves the information about the available clients and the files. The Hadoop clients (computers) are called nodes.

The "name node" is currently a single point of failure. The Hadoop project is working on solutions for this.

1.2. Typical tasks

Apache Hadoop can be used to filter and aggregate data, e.g. a typical usecase would be the analysis of webserver logs file to find the most visisted pages. But MapReduce has been use to transverse graphs and other tasks.

1.3. Writing the map and reduce functions

Hadoop allow that map and reduce functions are written in Java. Hadoop provides also linker so that map and reduce funtions can be written in other languages, e.g. C++, Python, Per, etc.

2. Hadoop file system

The Hadoop file system (HDSF) is a distribute file system. It use an existing file system of the operating system but extends this with redundancy and distribution. HSDF hides the complexity of distributed storage and redundancy from the programmer.

In the standard configuration HDFS saves all files three times on different nodes. The "name node" (server) has the information where the files are stored.

Harddisks are very effective in reading large files sequentially but getting much slower during random access. HDFS is therefore optimized for large files.

To improve performance Hadoop also tries to move the computation to the nodes which store the data and not vice versa. Especially if you have very large data this help improving the performance as you can avoid that the network becomes the bottleneck.

3. MapReduce

Apache Hadoop jobs are working according to the MapRecude principle. See MapReduce for details.

4. Installation

Apache Hadoop can be downloaded from Hadoop Homepage. To get started with Hadoop you require the following sub-projects:

Hadoop Common
MapReduce
HDFS

LINK-http://developer.yahoo.com/hadoop/tutorial/
http://erpschools.com/articles/bpel-tutorial-for-beginners-with-helloworld-example

Search This Blog

Network Simulator (NS2)