Hadoop Tutorial – The Hadoop Distributed File System (HDFS)


The Hadoop Distributed File System (HDFS) is one of the key services for Hadoop. HDFS is a distributed file system that abstracts each individual hard disk file system form a specific node. With HDFS, you get a virtual file system that spans over several nodes and allows you to store large amounts of data. HDFS can also operate in a non-distributed way as a standalone system but the purpose of it is to serve as a distributed file system.

One of the nice things about HDFS is that it runs on almost any hardware – which gives us the possibility to integrate existing systems into Hadoop. HDFS is also fault tolerant, reliable, scalable and easy to extend – just like any other Hadoop project!

HDFS works with the assumption that failures do happen – and is built to work fault-tolerant. HDFS is built to reboot in case of failures. Recovery is also easy with HDFS.

As streaming is a major trend in Big Data analytics, HDFS is built to serve that. HDFS allows to access streaming data via batch-processes.

HDFS is built for large amounts of data – you would usually store some terabytes of data in HDFS. The model of HDFS is built for a “write once, read many” approach, which means that it is fast and easy to read data, but writing data might not be as performant. This means that you wouldn’t use Hadoop to build an application on top of it that serves other purposes than providing analytics. That’s not the target for HDFS.

With HDFS, you basically don’t move data around. Once the data is in HDFS, it will likely stay there since it is “big”. Moving this data to another place might not be effective.

HDFS architecture

The above figure shows the HDFS architecture. HDFS has NamedNodes, which take care of the Metadata handling, distribution of files and alike. The client talks to HDFS itself to write and read files, without knowing on which (physical) node the file resides.

There are several possibilities to access HDFS:

  • REST: HDFS exposes a Rest-API which is called WebHDFS. This REST-API is also used from Java.
  • Libhdfs: This is what you use when accessing HDFS from C or C++.
Advertisements

Published by

Mario Meir-Huber

I work as Big Data Architect for Microsoft. With this role, I support my customers in applying Big Data technologies - mainly Hadoop/Spark - for their use-cases. I also teach this topic at various universities and frequently speak at various Conferences. In 2010 I wrote a book about Cloud Computing, which is often used at German & Austrian Universities. In my home country (Austria) I am part of several organisations on Big Data.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s