Posts

Showing posts from April, 2015

HDFS Architecture

Image
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems.However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large datasets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject. HDFS has a master/slave architecture . An HDFS cluster consists of a single Name Node , a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes , usually one per node in the cluster, which manage storage attached to the nodes that they ...

Components of Apache Hadoop 2.6.X and later

Apache Hadoop System has been classified in components with their specific role to have loose coupling architecture. Following are the components : Commons : Apache hadoop provides the web support for monitoring the HDFS via proxy server. Direct writing capability to the graphite. HDFS The Hadoop Distributed File System (HDFS) is a distributed file system designed to run oncommodity hardware. It has many similarities with existing distributed file systems.However, the differences from other distributed file systems are significant. HDFS is highlyfault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides highthroughput access to application data and is suitable for applications that have large datasets. HDFS relaxes a few POSIX requirements to enable streaming access to file systemdata. HDFS was originally built as infrastructure for the Apache Nutch web search engineproject. HDFS is now an Apache Hadoop subproject. Map Reduce The key techn...

Introduction to Apache Hadoop

Hadoop is an open source framework for writing and running distributed applications that process large amounts of data. Distributed computing is a wide and varied field, but the key distinctions of Hadoop are that it is : Accessible : Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon’s Elastic Compute Cloud (EC2). Robust : because it is intended to run on commodity hardware, Hadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. Scalable : Hadoop scales linearly to handle larger data by adding more nodes to the cluster. Simple : Hadoop allows users to quickly write efficient parallel code. Hadoop’s accessibility and simplicity give it an edge over writing and running large distributed programs. Even college students can quickly and cheaply create their own Hadoop cluster. On the other hand, its robustness and scalability make it suitable for even the most dem...