Hadoop Tutorial – The Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is one of the key services for Hadoop. HDFS is a distributed file system that abstracts each individual hard disk file system form a specific node. With HDFS, you get a virtual file system that spans over several nodes and allows you to store large amounts of data. HDFS can also operate in a non-distributed way as a standalone system but the purpose of it is to serve as a distributed file system.

One of the nice things about HDFS is that it runs on almost any hardware – which gives us the possibility to integrate existing systems into Hadoop. HDFS is also fault tolerant, reliable, scalable and easy to extend – just like any other Hadoop project!

HDFS works with the assumption that failures do happen – and is built to work fault-tolerant. HDFS is built to reboot in case of failures. Recovery is also easy with HDFS.

As streaming is a major trend in Big Data analytics, HDFS is built to serve that. HDFS allows to access streaming data via batch-processes.

HDFS is built for large amounts of data – you would usually store some terabytes of data in HDFS. The model of HDFS is built for a “write once, read many” approach, which means that it is fast and easy to read data, but writing data might not be as performant. This means that you wouldn’t use Hadoop to build an application on top of it that serves other purposes than providing analytics. That’s not the target for HDFS.

With HDFS, you basically don’t move data around. Once the data is in HDFS, it will likely stay there since it is “big”. Moving this data to another place might not be effective.

HDFS architecture

The above figure shows the HDFS architecture. HDFS has NamedNodes, which take care of the Metadata handling, distribution of files and alike. The client talks to HDFS itself to write and read files, without knowing on which (physical) node the file resides.

There are several possibilities to access HDFS:

  • REST: HDFS exposes a Rest-API which is called WebHDFS. This REST-API is also used from Java.
  • Libhdfs: This is what you use when accessing HDFS from C or C++.

Hadoop Tutorial – Scheduling MapReduce Workflows with Oozie

Apache Oozie is the workflow scheduler for Hadoop Jobs. Oozie basically takes care of the step-wise workflow iteration in Hadoop. Oozie is like all other Hadoop projects built for high scalability, fault tolerance and extensible.

An Oozie Workflow is started by data availability or after a specific time. Oozie is the root for all MapReduce jobs as they get scheduled via Oozie. This also means that all other projects such as Pig and Hive (which we will discuss later on) also take advantage of Oozie.

Oozie workflows are described in an XML-Dialect, which is called hPDL. Oozie knows two different types of nodes:

  • Control-Flow-Nodes that take do exactly what the name says: controlling the flow.
  • Action-Nodes take care of the actual execution of a job.

The following illustration shows the iteration process in an Oozie Workflow. The first step for Oozie is to start a task (MapReduce Job) on a remote system. Once the task has completed, the remote system sends the result back to the remote system via a callback function.

Apache Oozie
Apache Oozie

Hadoop Tutorial – Apache ZooKeeper for distributed coordination

One of the key infrastructure services for Hadoop is Apache ZooKeeper. ZooKeeper is in charge of coordinating nodes in the Hadoop cluster. Key challenges for ZooKeeper in that domain are to provide high availability for Hadoop and to take care of the distributed coordination.

Under these challenges, Hadoop takes care of managing the cluster configuration for Hadoop. A key challenge in the Hadoop Cluster is naming, which has to be applied to all nodes within a cluster. Apache ZooKeeper takes care of that by providing unique names to individual nodes based on naming conventions.

The hierarchy in Zookeeper
The hierarchy in Zookeeper

As shown in Figure 7, naming is hierarchical. This means that naming also occurs via a path. The Root instance starts with a “/”, all successors have their unique name, and their successors also apply this naming schema. This enables the cluster to have nodes with child-nodes, which in return has positive effects on maintainability.

ZooKeeper takes care of synchronization within the distributed environment and provides some group services to the Hadoop Cluster. As of synchronization, there is one server in the ZooKeeper Service that acts as the “Leader” of all servers running under the ZooKeeper Service. The following illustration shows this.

Synchronisation in the ZooKeeper Service
Synchronisation in the ZooKeeper Service

To ensure a high uptime and availability, individual servers in the ZooKeeper service are mirrored. Each of the servers in the service knows any other server. In case that one server has a failure and isn’t available any more, clients connect to other servers. The ZooKeeper service itself is built for failover and is also highly scalable.

Hadoop Tutorial – Apache Ambari for Cluster Management

Apache Ambari was developed by the Hadoop distributor Hortonworks and also comes with their distribution. The aim of Ambari is to make the management of Hadoop clusters easier. Ambari is useful, if you run large server farms based on Hadoop. Ambari automates much of the manual work you would need to do with Hadoop when managing your cluster from the console.

Ambari comes with three key aspects around cluster management: first, it is about provisioning instances. This is helpful when you want to add new instances to your Hadoop cluster. Ambari takes care of automating all aspects of adding new instances. Next, there is monitoring. Ambari monitors your server farm and gives you an overview on what is going on. The last aspect is the management of your server farm itself.

Provisioning has always been a very tricky part of Hadoop. When someone wanted to add new nodes to a cluster, this was basically not an easy thing to do and included a lot of manual work. Most organizations abstracted this problem by creating scripts and using automation software, but this simply couldn’t fill the scope that is often necessary in Hadoop clusters. Ambari provides an easy-to-use assistant that enables users to install new services or activate/deactivate them. Ambari takes care of the entire cluster provisioning and configuration with an easy UI.

Ambari also includes comprehensive monitoring capabilities for the cluster. This allows user to view the status of the cluster in a dashboard and to get to know immediately what the cluster is up to (or not). Ambari uses Apache Ganglia to collect the metrics. Ambari also integrates the possibility to send System messages via Apache Nagios. This includes alerts and other things that are necessary for the administrator of the cluster.

Other key aspects of Ambari are:

  • Extensibility. Ambari is built on a plug-in architecture, which basically allows you to extend Ambari with your own functionality used within your company or organization. This is useful if you want to integrate Hadoop into your business processes.
  • Fault Tolerance. Ambari takes care of errors and reacts to them. For example, if an instance has an error, Ambari restarts this instance. This takes away much of the headache you got in previous, pre-Ambari, versions of Hadoop.
  • Secure. Ambari uses a role-based authentication. This gives you more control over sensitive information in your cluster(s) and enables you to apply different roles.
  • Feedback. Ambari provides Feedback to the user(s) about long-running processes. This is especially useful for stream processing and near-real-time processes that basically have no end of their lifespan.

Apache Ambari can be accessed easily via two different ways: first, Ambari provides a mature UI that enables you to access the cluster management via a Browser. Furthermore, Ambari can also be accessed via ReSTful Web Services, which gives you additional possibilities in working with the service.

The following illustration outlines the Ambari Server and the Agents Communication.

Apache Ambari
Apache Ambari

As of the architecture, Ambari leverages several projects. As key elements, Ambari uses message queues for communication. The configuration within Apache Ambari is done by Puppet. The next figure shows the overall architecture of Ambari.


Apache Ambari Architecture
Apache Ambari Architecture