Apache Hadoop is a open source software and used for distributed storage and distributed processing of Large Data Set. Hadoop is generally utilized to process Big Data. Hadoop can process very large data sets on computer clusters built from commodity (commercially available) hardware.
Hadoop is composed of following 4 elements:
- Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
- Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
- Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications;and
- Hadoop MapReduce – an implementation of the MapReduce programming model for large scale data processing.
Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed. This approach takes advantage of data locality nodes manipulating the data they have access to— to allow the data set to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.
Hadoop Supported File Systems:
- HDFS: Hadoop’s own rack-aware file system.This is designed to scale to tens of petabytes of storage and runs on top of the file systems of the underlying operating systems.
- FTP File system: this stores all its data on remotely accessible FTP servers.
- Amazon S3 (Simple Storage Service)file system. This is targeted at clusters hosted on the Amazon Elastic Compute Cloud server-on-demand infrastructure. There is no rack-awareness in this file system, as it is all remote.
- Windows Azure Storage Blobs (WASB): WASB, an extension on top of HDFS, allows distributions of Hadoop to access data in Azure blob stores without moving the data permanently into the cluster.
|Developer(s)||Apache Software Foundation|
|Initial release||December 10, 2011|
|Stable release||2.7.2 / January 25, 2016|
|Type||Distributed file system|
|License||Apache License 2.0|
Hadoop can e utilized for :
- Log and/or click stream analysis of various kinds
- Marketing analytics
- Machine learning and/or sophisticated data mining
- Image processing
- Processing of XML messages
- Web crawling and/or text processing
- General archiving, including of relational/tabular data, e.g. for compliance
Cost of Hadoop
Hadoop is a opensource and free software. However, to make Hadoop work there is a associated cost which is not negligible. I have found a cost table for Hadoop set up perceivant.com which I liked:
|Setup ($9,000 x 4 months)||$36,000|
|Use Case Development||$200,000 – $1M|
|Hosted Hadoop Set-Up Cost:||$236,000 – $1.04M|
|Talent & Services||Monthly Cost|
36,490 total views, 3 views today