Hadoop Tutorial – an overview of Hadoop projects

Last week I wrote a blog post introducing the Hadoop project and gave an overview of the Map/Reduce algorithm. This week, I will outline the Hadoop stack and major technologies in the Hadoop step. Please note: there are many projects in the Hadoop stack and this is not complete. The following figure will outline major Hadoop projects.

The Hadoop technology stack
The Hadoop technology stack

I have clustered the Hadoop stack into several areas. The lowest area is the cluster management. This level is everything about managing and running Hadoop. Projects on this layer include Ambari for provisioning, monitoring and management, Zookeeper for the coordination and reliability and Oozie for Workflow-scheduling. This layer is focused on infrastructure and if you work on this layer, you normally don’t analyse data (yet).

Moving one level up, we find ourselves in the “Infrastructure” layer. This layer is not about physical or virtual machines or disk storage. I called it “Infrastructure” since it contains projects that are used by other Hadoop components. This includes Apache Commons, a shared library, and the HDSF (Hadoop Distributed File System). HDFS is used by all other projects and it is a virtual file system that can span over many different servers and abstracts individual (machine-based) file systems to one common file system.

The next layer could also be called the 42 layer. Apache YARN is the core of almost everything you do in Hadoop. YARN takes care of the Map/Reduce jobs and many other things including resource management, job management, job tracking and job scheduling.

The next layer is all about data. As we can see here, this layer contains a lot of projects for the 3 core things when it comes to data: data storage, data access and data science. As of data storage, a key project is HBase, a distributed, key/value database. It is built for large amounts of data. We will dig deeper into HBase in a couple of weeks from now. Data access includes several important projects such as Hive (a SQL-like query language), Pig (a data flow language), streaming and in-memory processing for real-time applications such as Spark and Storm, and Graph processing with Giraph. Mahout is the only project in the data science layer. Mahout is useful for machine learning, clustering and recommendation mining.

On the next layer, we have several tools for data governance and integration. When it is necessary to import data into Hadoop, we can find projects on this layer.

The last layer consists of Apache Hue. This is the Hadoop UI that makes our lives easier 😉

Next week, I will give more insights on the individual layers discussed here. Stay tuned 😉



Published by

Mario Meir-Huber

I work as Big Data Architect for Microsoft. With this role, I support my customers in applying Big Data technologies - mainly Hadoop/Spark - for their use-cases. I also teach this topic at various universities and frequently speak at various Conferences. In 2010 I wrote a book about Cloud Computing, which is often used at German & Austrian Universities. In my home country (Austria) I am part of several organisations on Big Data.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s