Big Data challenges: storage performance

Big Data needs Big Storage and storage is at the end a physical device. Until now, most storage devices are hard disks that require mechanical movements. A common hard drive available today (December 2012) has 15,000 (Seagate, 2013) revolutions per minute (rpm) and a desktop hard drive has some 7,200-rpm. In any case, this means that there is significant latency involved until the reading head is in place. The mechanical approach to storage has been around for decades and scientists as well as engineers complain about storage performance. In-memory was always faster than hard disk storage and the network speed is higher than what can be done with hard disks. (Anthes, 2012) states that disk based storage is about 10-100 times slower than a network and about 1,000 times slower than main memory. This means that there is a significant “bottleneck” when it comes to delivering data from a disk-based storage to an application. As big data is about storing and analyzing data, this is a major challenge to Big Data Applications. It doesn’t help us much if we have enough compute power to analyze data but our disks simply can’t deliver the data in a fast way.

When we look at supercomputers nowadays, they are often measured in cores and Teraflops (Top 500 Supercomputers Site, 2012). This is basically good if you want to do whatever kind of calculation such as the human genome. However, this doesn’t tell us anything about disk performance if we want to store or analyze data. (Zverina, 2011) cites Allan Snavely when he proposes to include the disk performance in such metrics as well:

“I’d like to propose that we routinely compare machines using the metric of data motion capacity, or their ability to move data quickly” – Allan Snavely

Allan Snavely also stated that with increasing data size – hard disks are getting higher in capacity but access time stays the same – it is harder to find data. This can be illustrated easily: you have an external hard disk with the capacity of 1 TB. The hard disk operates with 7,200 rpm and a cache of 16MB. There are 1,000 Videos stored on this hard drive, each with a size of 1 GB. This would fill the entire hard disk. If you now change to a larger system as your videos grow, you would change to a 2 TB system. If this System is full, you won’t be able to transfer the videos to another system in the same time as you did with the 1 TB hard drive. It is very likely that your 2 TB System now needs about twice as much time to transfer the data. Whereas compute performance grows, the performance to access data stays about the same. Given the growth of data and storage capacity, it even gets slower. Allan Snavely (Zverina, 2011) describes this with the following statement:

“The number of cycles for computers to access data is getting longer – in fact disks are getting slower all the time as their capacity goes up but access times stay the same. It now takes twice as long to examine a disk every year, or put another way, this doubling of capacity halves the accessibility to any random data on a given media.”

In the same article, Snavely suggests to include the following metrics in a computer’s performance: DRAM, flash memory, and disk capacity.

But what can enterprises do to achieve higher through output of their systems? There is already some research about that and most resources point towards Solid State Disks as Storage (SSD). Solid State Disks are getting commodity hardware in high end Personal Computers, but they are not that common for servers and distributed systems yet. SSDs normally have better performance but lower disk space and the price per GB is more expensive. If we talk about large-scale databases that have the need for performance, SSDs might be a better choice. The San Diego Supercomputing Center (SDSC) built a supercomputer with SSDs. This computer is called “Gordon” and can handle Data up to 100 times faster as with normal drives (Zverina, 2011). Another prototype, called “Moneta” (Anthes, 2012) used a phase change memory to boost I/O performance. The performance was about 9.5 times faster as a normal RAID-System and about 2.8 times faster as a flash-based raid system.

There is significant research around this topic as the performance of storage is a problem to large-scale data centric systems as we now have with Big Data Applications.


Published by

Mario Meir-Huber

I work as Big Data Architect for Microsoft. With this role, I support my customers in applying Big Data technologies - mainly Hadoop/Spark - for their use-cases. I also teach this topic at various universities and frequently speak at various Conferences. In 2010 I wrote a book about Cloud Computing, which is often used at German & Austrian Universities. In my home country (Austria) I am part of several organisations on Big Data.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s