Cloud Computing:Hadoop
1. Apache Hadoop
Apache Hadoop is a software solution for distributed computing of large datasets. Hadoop provides a distributed filesystem (HDFS) and a MapReduce implementation.
A special computer acts as the "name node". This computer saves the information about the available clients and the files. The Hadoop clients (computers) are called nodes.
The "name node" is currently a single point of failure. The Hadoop project is working on solutions for this.
Apache Hadoop can be used to filter and aggregate data, e.g. a typical usecase would be the analysis of webserver logs file to find the most visisted pages. But MapReduce has been use to transverse graphs and other tasks.
The Hadoop file system (HDSF) is a distribute file system. It use an existing file system of the operating system but extends this with redundancy and distribution. HSDF hides the complexity of distributed storage and redundancy from the programmer.
In the standard configuration HDFS saves all files three times on different nodes. The "name node" (server) has the information where the files are stored.
Harddisks are very effective in reading large files sequentially but getting much slower during random access. HDFS is therefore optimized for large files.
To improve performance Hadoop also tries to move the computation to the nodes which store the data and not vice versa. Especially if you have very large data this help improving the performance as you can avoid that the network becomes the bottleneck.
3. MapReduce
Apache Hadoop jobs are working according to the MapRecude principle. See MapReduce for details.
4. Installation
Apache Hadoop can be downloaded from Hadoop Homepage. To get started with Hadoop you require the following sub-projects:
- Hadoop Common
- MapReduce
Comments
Post a Comment