Hadoop Ecosystem: A Beginner’s Overview
“The number all of us have to really pay attention to is, there will be 50 billion connected devices by 2025, that means we will have 50 billion ends points that we get to really harness. Then comes the data part, these devices are expected to generate 175 zettabyte of data, a quadrupled growth, from the current 45 zettabyte,”- Satya Nadella
Earlier with limited data, all it required was a processor and one storage unit. But with millions of device generating massive amounts, data cannot be stored processed and analysed using traditional ways. Moreover, these data poses another set of challenges like the velocity in which it is created, the variety of data created (excel, jpeg, streaming data etc.) value of data created and so on.
Hadoop as a Solution
Hadoop is an open source software framework that manages data storage in a distributed way and processes it parallelly on commodity hardware.
- Hadoop HDFS
Hadoop Distributed File System (HDFS) is a distributed file system that allows to store huge data sets across cluster nodes or multiple machines. It follows a distributed File system architecture or master-slave topology.
It has 2 core components
One the Namenode which manages and maintains the Datanode. It also records the metadata. The other is the Datanode which stores the actual data, does reading, writing and processing. It performs replication as well.
Each file is stored in HDFS as blocks. Each block has a default size of 128MB in Apache Hadoop.
- Provides distributed storage
- Implemented on Commodity hardware
- Provides Data Security
- Highly Fault Tolerant
- HDFS also provides streaming access to file system data
- Provides file permission and authentication
- HDFS uses a command line interface to interact with Hadoop
YARN is a cluster resource management layer of Hadoop which responsible for managing cluster resources like RAM, memory and other resources. It acts like an operating system scheduling jobs and allocating resources.
A number of frameworks can be run on top of YARN like Hadoop MapReduce, Tez, Apache HBase, Storm, Spark etc.
It has two components one the Resource Manager (Master) and two Node Manager (Slave).
Hadoop Map Reduce/ Processing unit of Hadoop.
Programming technique where huge amount of data is processed and stored in a parallel and distributed fashion.
In map reduce approach, processing is done at the slave nodes and final result is sent to master node.
Hadoop alone cannot provide all the facilities and all the processing of big data on its own. Hadoop Ecosystem, is a collection of additional software packages that can be installed on top of or alongside Hadoop for various tasks.
There are four steps for BigData processing
Step 1: Data Collection and Ingestion
Step 2: Data Processing
Step 3: Data Analysis
Step 4: Access
Let’s examine how each component help in the above steps.
Step 1: Data Collection and Storage
Data of various types and sources like relational databases, systems or local files are ingested into Hadoop.
Sqoop is a tool designed to transfer data (import and export ) between Hadoop and external data storage systems like relational database servers (MSSQL Server and MySQL).
Its features include parallel import/export, import results of SQL Query, full and incremental load and Kerbos security integration.
Sqoop uses YARN framework to import/export data. This provides fault tolerance on top of parallelism. It also has connectors for all major RDBMS databases.
Flume is a distributed servicethat collects event data such as that collects event data and transfers it to HDFS. It is ideally suited for event data from multiple systems
Apache HBase is a column oriented database management system derived from Google’s NoSQL database BigTable that runs on top of HDFS. It stores data in HDFS. It is mainly used when you need random, real time, read/write access to your BigData. It is horizontally scalable and hence can provide support to high volume of data and high throughput.
HBase is written in NoSQL database written in Java which performs faster querying. It is also well suited for sparse data sets.
Step 2: Data Processing
The framework’s like Spark and MapReduce perform the data processing.
Spark is an open source cluster computing framework to store and process data in real time across various clusters. It provides 100 times performance for a few applications with in memory primitives as compared to the two stage disk based MapReduce.
Spark can run in the Hadoop cluster and processes data in HDFS.
It also supports wide variety of workload which includes machine learning business intelligence streaming and batch processing across various clusters.
Step 3: Data Analysis
After data is processed it is analysed .This can be done by an open source high level data flow system like Pig and Impala.
Pig is mainly used for analytics. It is a scripting platform designed to process and analyze large data sets and it runs on Hadoop clusters.
Pig converts pig scripts into map and reduce codes thus saving the programmer from writing complex map reduce programs. Ad-hoc queries like Join and filter which are difficult to perform can be done easily using pig.
Impala is an open source SQL engine that run on Hadoop cluster. It can be used to run a query, evaluate the results immediately, and fine tune the query.
Impala is ideal for interactive analysis. It can analyze Hadoop data via SQL and other business intelligence tools. It has very low latency and can be measured in milliseconds.
Hive is another framework used for data analysis in Hadoop. It is a data warehouse system which is used for querying and analyzing large data sets in HDFS. It executes queries using MapReduce. However user need not write any code in low level map reduce
Hive acts as an abstraction layer on top of Hadoop. It is usually preferred for data processing and extract transform load operations (ETL)
Step 4: Data Exploration / Search
Search of data can be done using frameworks like Cloudera search and Hue/Hadoop User experience.
Cloudera Search requires no technical or programming skill because it provides a full text interface for searching. It uses flexible scalable and robust storage system .This eliminates the need to move large data sets across infrastructures to address business tasks.
Hue is an acronym for Hadoop User experience. It is an open source web interface for analyzing data with Hadoop. You can use hue for the following operations
- Query a table in Hive and Impala
- Run spark and pig jobs and workflows
- Search data
Hadoop jobs such as map reduce pig hive and scoop uses work flows. Oozie is a workflow or coordination system that you can use to manage Hadoop jobs. There are also other components like Apache Ambari, Apache Mahout Etc. Read more about Oozie on
Another workflow coordination system on BigData system like Oozie is the Zookeeper. Originally developed by Yahoo it is an open source coordination service for distributed applications. It is used for keeping a log on configuration information, naming, ensuring distributed synchronization and group services. The principal features of Zookeeper include reliability, scalability and fast processing.
The above article only deals with the basic of Hadoop ecosystem. It is very robust system and has grown over the years with many components added over the years. Based on the use cases, we can choose a set of services from Hadoop Ecosystem and create a tailored solution for an organization.
Hadoop Map Reduce: https://www.google.com/search?q=hadoop+mapreduce&rlz=1C1GCEU_enIN870IN870&oq=Hadoop+MapReduce&aqs=chrome.0.0l7j69i61.4274j0j9&sourceid=chrome&ie=UTF-8
Apache HBase: https://hbase.apache.org/