Big Data: what or who is the data scientist?

As described in an earlier post here I outlined the fact that becoming a data scientist requires a lot of knowledge.

Focusing back, a data scientist needs to have knowledge in different IT domains:

  • General understanding of distributed systems and how they work. This includes administration skills for Linux as well as hardware related skills such as networking.
  • Knowledge in Hadoop or similar technologies. This knowledge basically builds on top of the former one but it is sort of different and requires a more software focused knowledge.
  • Great statistical/mathematical knowledge. This is necessary to actually work on the required tasks and to figure out how they can be applied to real algorithms.
  • Presentation skills. All is worth nothing as long as someone can’t represent the data or things found in the data. The management might not see the points if the person can’t present data in an appropriate way.

In addition, there are some other skills necessary:

  • Knowledge of the legal situation. The legal basics are different from country to country. Though the european union gives some legal borders within member states, there are also differences.
  • Knowledge of the society impacts. It is also necessary to understand how society might react to data analysis. Especially in marketing it is absolutely necessary to handle that correct

Since more and more IT companies focus on looking for the ideal data scientist, people should first try to find out who is capable of handling all of these skills. The answer to this might be: there is no person that can handle all. It is likely that one person is great in distributed systems and Hadoop but might fail in transforming questions to algorithms and finally presenting them.

Data Science is more of a team effort than a single person that can handle all of it. Therefore, it is rather necessary to build a team that will be able to address all of these challenges.

Big Data challenges: moving data for analysis

Another issue with Big Data is indicated by (Alexander, Hoisie , & Szalay , 2011). The problem is that Data can’t be moved easily for analysis. With Big Data, we often have some Terabytes or more. Moving this via a network connection is not that easy or even impossible. If real-time data is analyzed, it is literally impossible to move that amount of data to another cluster, since the data will be incorrect or not available at this time. Real-Time data analysis is also necessary in fraud protection. If this data now has to be moved to another cluster, it might already be too late. In traditional databases, this wasn’t that hard since the data was often some Gigabyte in a single database. With Big Data, data is in various formats, at high volume and at high velocity. To comply with all these factors and moving data to another cluster, this might not be possible.

(Alexander, Hoisie , & Szalay , 2011) describes some factors that influence the challenges of moving data to another cluster: high-flux data, structured and unstructured data, real-time decisions and data organization.

High-flux data describes data that arrives in real time. If the data must be analyzed, this also has to be done in real-time. The data might be gone or modified at a later point. In Big Data applications, data will arrive structured as well as unstructured. Decisions on Data must often be done in real time. If there is a data stream of financial transactions, an algorithm must decide in real time if the data needs more detailed analysis. If not all data is stored, an algorithm must decide if the data is stored or not. Data organization is another challenge when it comes to moving data.

Big Data challenges: data partitioning and concurrency

Data needs to be partitioned if it can’t be stored on a single system. With Big Data applications, we don’t talk about small storages but rather about distributed systems. Data might be partitioned over hundred or thousand of nodes and the database must scale out to that demand. Data partitioning is a key concept for databases and it serves as well in Big Data applications. However, if data is distributed over some servers, it might take a while until all nodes are informed about the changes. To avoid concurrency issues, the data must be locked. This might result in a poor database performance if the database should be kept consistent at all time. One solution is to forget about data consistency in favor of data partitioning. This approach is described in detail in section 1.6.2 when we will focus on the CAP-Theorem.

Let’s imagine a Web shop. There are 2 users in our sample; both of them (let’s call them User A and User B) want to buy a Product P. There is exactly one item on stock. User A sees this and proceeds with the checkout, as well as User B. They complete the order at about the same time. The Database in our sample is designed in a way that partitioning is preferred over consistency and both Users get the acknowledgement that their Order was processed. Now we would have -1 items in stock since no database trigger or any other command told us that we ran out of items. We either have to tell one User to “forget” the order or have to find a way to deliver the item to both users. In any case, one user might get angry. Some web shops solved this issue in a non-technical way: they tell the user “sorry, we are unable to deliver in time” and give them the option to cancel the order or take a voucher. However, there is no simple technical solution to that. In most cases, it will cost money to the company. If the web shop would use a system built for consistency, it might run into database outages. Users might not buy products at their web site since the web site is simply “not available”. The web shop can either loose money by users that were unable to buy products because of delays in the database or by consistency issues. In the case of web shop outage, users might not return and buy products since they are annoyed about the “bad performance of the website” and “inability to process the order”, whereas people would return and buy other products if they get a voucher because of issues that came with data partitioning and concurrency.

Big Data challenges: different storage systems

A main factor to Big Data is the variety of data. Data may not only change over time (e.g. a web shop not only wants to sell books but also cars) but will also have different formats. Databases must provide this possibility. Companies might not only store all their data in one single database but rather in different databases and different APIs consume different formats such as JSON, XML or any other type. Facebook, for instance, uses MySQL, Cassandra and HBase to store their data. They have three different storage systems (Harris, 2011) (Muthukkaruppan, 2010), each of them serving a different need.

(Helland, 2011) described the challenges for datastores with 4 key principles:

  • unlocked data
  • inconsistent schema
  • extract, transform and load
  • too much to be accurate

By unlocked data, it is meant that data is usually locked but with Big Data, this might result in problems, as they don’t rely on locked data. On the other hand, unlocked data leads to semantically changes in a database. With inconsistent schema, (Helland, 2011) describes the challenge of data from different sources and formats. Schema needs to be somewhat flexible to deal with extensibility. As stated earlier, businesses change over time and so does the data schema. Extract, transform and load is something very specific to Big Data Systems, since data comes from many different sources and it needs to be put into place in a specific system. Too much to be accurate outlines the “velocity” problem with Big Data applications. If data is calculated, the result might not be exact since the data the calculation was built upon might have already changed. (Helland, 2011) states that you might not be accurate at all and you can only guess results.

Big Data challenges: storage performance

Big Data needs Big Storage and storage is at the end a physical device. Until now, most storage devices are hard disks that require mechanical movements. A common hard drive available today (December 2012) has 15,000 (Seagate, 2013) revolutions per minute (rpm) and a desktop hard drive has some 7,200-rpm. In any case, this means that there is significant latency involved until the reading head is in place. The mechanical approach to storage has been around for decades and scientists as well as engineers complain about storage performance. In-memory was always faster than hard disk storage and the network speed is higher than what can be done with hard disks. (Anthes, 2012) states that disk based storage is about 10-100 times slower than a network and about 1,000 times slower than main memory. This means that there is a significant “bottleneck” when it comes to delivering data from a disk-based storage to an application. As big data is about storing and analyzing data, this is a major challenge to Big Data Applications. It doesn’t help us much if we have enough compute power to analyze data but our disks simply can’t deliver the data in a fast way.

When we look at supercomputers nowadays, they are often measured in cores and Teraflops (Top 500 Supercomputers Site, 2012). This is basically good if you want to do whatever kind of calculation such as the human genome. However, this doesn’t tell us anything about disk performance if we want to store or analyze data. (Zverina, 2011) cites Allan Snavely when he proposes to include the disk performance in such metrics as well:

“I’d like to propose that we routinely compare machines using the metric of data motion capacity, or their ability to move data quickly” – Allan Snavely

Allan Snavely also stated that with increasing data size – hard disks are getting higher in capacity but access time stays the same – it is harder to find data. This can be illustrated easily: you have an external hard disk with the capacity of 1 TB. The hard disk operates with 7,200 rpm and a cache of 16MB. There are 1,000 Videos stored on this hard drive, each with a size of 1 GB. This would fill the entire hard disk. If you now change to a larger system as your videos grow, you would change to a 2 TB system. If this System is full, you won’t be able to transfer the videos to another system in the same time as you did with the 1 TB hard drive. It is very likely that your 2 TB System now needs about twice as much time to transfer the data. Whereas compute performance grows, the performance to access data stays about the same. Given the growth of data and storage capacity, it even gets slower. Allan Snavely (Zverina, 2011) describes this with the following statement:

“The number of cycles for computers to access data is getting longer – in fact disks are getting slower all the time as their capacity goes up but access times stay the same. It now takes twice as long to examine a disk every year, or put another way, this doubling of capacity halves the accessibility to any random data on a given media.”

In the same article, Snavely suggests to include the following metrics in a computer’s performance: DRAM, flash memory, and disk capacity.

But what can enterprises do to achieve higher through output of their systems? There is already some research about that and most resources point towards Solid State Disks as Storage (SSD). Solid State Disks are getting commodity hardware in high end Personal Computers, but they are not that common for servers and distributed systems yet. SSDs normally have better performance but lower disk space and the price per GB is more expensive. If we talk about large-scale databases that have the need for performance, SSDs might be a better choice. The San Diego Supercomputing Center (SDSC) built a supercomputer with SSDs. This computer is called “Gordon” and can handle Data up to 100 times faster as with normal drives (Zverina, 2011). Another prototype, called “Moneta” (Anthes, 2012) used a phase change memory to boost I/O performance. The performance was about 9.5 times faster as a normal RAID-System and about 2.8 times faster as a flash-based raid system.

There is significant research around this topic as the performance of storage is a problem to large-scale data centric systems as we now have with Big Data Applications.