Hadoop

Apache Hadoop is a open source software and used for distributed storage and distributed processing of Large Data Set. Hadoop is generally utilized to process Big Data. Hadoop can process very large data sets on computer clusters built from commodity (commercially available) hardware.

Hadoop is composed of following 4 elements:

  • Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
  • Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
  • Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications;and
  • Hadoop MapReduce – an implementation of the MapReduce programming model for large scale data processing.

Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed. This approach takes advantage of data locality nodes manipulating the data they have access to— to allow the data set to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

 

Hadoop Supported File Systems:

  • HDFS: Hadoop’s own rack-aware file system.This is designed to scale to tens of petabytes of storage and runs on top of the file systems of the underlying operating systems.
  • FTP File system: this stores all its data on remotely accessible FTP servers.
  • Amazon S3 (Simple Storage Service)file system. This is targeted at clusters hosted on the Amazon Elastic Compute Cloud server-on-demand infrastructure. There is no rack-awareness in this file system, as it is all remote.
  • Windows Azure Storage Blobs (WASB): WASB, an extension on top of HDFS, allows distributions of Hadoop to access data in Azure blob stores without moving the data permanently into the cluster.

 

Developer(s) Apache Software Foundation
Initial release December 10, 2011; 4 years ago
Stable release 2.7.2 / January 25, 2016
Development status Active
Written in Java
Operating system Cross-platform
Type Distributed file system
License Apache License 2.0
Website hadoop.apache.org

[wikipedia.com]

Applications

Hadoop can e utilized for :

  • Log and/or click stream analysis of various kinds
  • Marketing analytics
  • Machine learning and/or sophisticated data mining
  • Image processing
  • Processing of XML messages
  • Web crawling and/or text processing
  • General archiving, including of relational/tabular data, e.g. for compliance

 

Cost of Hadoop

Hadoop is a opensource and free software. However, to make Hadoop work there is a associated cost which is not negligible.  I have found a cost table for Hadoop set up perceivant.com which I liked:

Set-Up Costs Cost
Setup ($9,000 x 4 months) $36,000
Use Case Development $200,000 – $1M
Hosted Hadoop Set-Up Cost: $236,000 – $1.04M
Talent & Services Monthly Cost

31,731 total views, 8 views today

Comments are closed.