Hadoop Tutorial – The Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is one of the key services for Hadoop. HDFS is a distributed file system that abstracts each individual hard disk file system form a specific node. With HDFS, you get a virtual file system that spans over several nodes and allows you to store large amounts of data. HDFS can also operate in a non-distributed way as a standalone system but the purpose of it is to serve as a distributed file system.

One of the nice things about HDFS is that it runs on almost any hardware – which gives us the possibility to integrate existing systems into Hadoop. HDFS is also fault tolerant, reliable, scalable and easy to extend – just like any other Hadoop project!

HDFS works with the assumption that failures do happen – and is built to work fault-tolerant. HDFS is built to reboot in case of failures. Recovery is also easy with HDFS.

As streaming is a major trend in Big Data analytics, HDFS is built to serve that. HDFS allows to access streaming data via batch-processes.

HDFS is built for large amounts of data – you would usually store some terabytes of data in HDFS. The model of HDFS is built for a “write once, read many” approach, which means that it is fast and easy to read data, but writing data might not be as performant. This means that you wouldn’t use Hadoop to build an application on top of it that serves other purposes than providing analytics. That’s not the target for HDFS.

With HDFS, you basically don’t move data around. Once the data is in HDFS, it will likely stay there since it is “big”. Moving this data to another place might not be effective.

HDFS architecture

The above figure shows the HDFS architecture. HDFS has NamedNodes, which take care of the Metadata handling, distribution of files and alike. The client talks to HDFS itself to write and read files, without knowing on which (physical) node the file resides.

There are several possibilities to access HDFS:

  • REST: HDFS exposes a Rest-API which is called WebHDFS. This REST-API is also used from Java.
  • Libhdfs: This is what you use when accessing HDFS from C or C++.

Software defined Storage (SdS) in the Cloud

Cloud Computing gave us several changes in how we handle IT nowadays. Common tasks that used to take a lot of time received great automation and much more is still about to come. Another interesting development is the “Software defined X”. This basically means that infrastructure elements receive larger automation as well, which ends up being more scale able and better to utilize from applications. A frequent term used lately is the “Software defined Networking” approach, however, there is another one that sounds promising, especially for Cloud Computing and Big Data: Software defined Storage.

Software defined Storage gives us the promise to abstract the way how we use storage. This is especially useful for large scale systems, as no one really wants to care about how to distribute the content to different servers. This should basically be opaque to end-users (software developers). For instance, if you are using a storage system for your website, you want to have an API like Amazon’s S3. there is no need to worry about on which physical machine your files are stored – you just specify the desired region. The back-end system (in this case, Amazon S3) takes care of that.

Software defined Storage explained
Software defined Storage explained

As of the architecture, you simply communicate with the abstraction layer, that takes care of the distribution, redundancy and other factors.

At present, there are several systems available that takes care of that: next to the well-know systems such as Amazon S3, there are also other solutions such as the Hadoop Distributed File System (HDFS) or GlusterFS.


Header Image Copyright: nyuhuhuu. Licensed under the Creative Commons 2.0.