A main factor to Big Data is the variety of data. Data may not only change over time (e.g. a web shop not only wants to sell books but also cars) but will also have different formats. Databases must provide this possibility. Companies might not only store all their data in one single database but rather in different databases and different APIs consume different formats such as JSON, XML or any other type. Facebook, for instance, uses MySQL, Cassandra and HBase to store their data. They have three different storage systems (Harris, 2011) (Muthukkaruppan, 2010), each of them serving a different need.
(Helland, 2011) described the challenges for datastores with 4 key principles:
- unlocked data
- inconsistent schema
- extract, transform and load
- too much to be accurate
By unlocked data, it is meant that data is usually locked but with Big Data, this might result in problems, as they don’t rely on locked data. On the other hand, unlocked data leads to semantically changes in a database. With inconsistent schema, (Helland, 2011) describes the challenge of data from different sources and formats. Schema needs to be somewhat flexible to deal with extensibility. As stated earlier, businesses change over time and so does the data schema. Extract, transform and load is something very specific to Big Data Systems, since data comes from many different sources and it needs to be put into place in a specific system. Too much to be accurate outlines the “velocity” problem with Big Data applications. If data is calculated, the result might not be exact since the data the calculation was built upon might have already changed. (Helland, 2011) states that you might not be accurate at all and you can only guess results.