Big Data 101: Partitioning

Partitioning is another factor for Big Data Applications. It is one of the factors of the CAP-Theorem (see 1.6.1) and is also important for scaling applications. Partitioning basically describes the ability to distribute a database over different servers. In Big Data Applications, it is often not possible to store everything on one (Josuttis, 2011)

Data Partitioning
Data Partitioning

The factors for partitioning illustrated in the Figure: Partitioning are described by (Rys, 2011). Functional partitioning is basically describing the service oriented architecture (SOA) approach (Josuttis, 2011). With SOA, different functions are provided by their own services. If we talk about a Web shop such as Amazon, there are a lot of different services involved. Some Services handle the Order Workflow; other Services handle the search and so on. If there is high load on a specific service such as the shopping cart, new instances can be added on demand. This reduces the risk of an outage that would lead to loosing money. Building a service-oriented architecture simply doesn’t solve all problems for partitioning. Therefore, data also has to be partitioned. By data partitioning, all data is distributed over different servers. They can also be distributed geographically. A partition key basically identifies partitioned Data. Since there is a lot of data available and single nodes may fail, it is necessary to partition data in the network. This means that data should be replicated and stored redundant in order to deal with node failures.

Big Data 101: Scalability

Scalability is another factor of Big Data Applications described by (Rys, 2011). Whenever we talk about Big Data, it mainly involves high-scaling systems. Each Big Data Application should be built in a way that eases scaling. (Rys, 2011) describes several needs for scaling: user load scalability, data load scalability, computational scalability and scale agility.

Data Scalability
Data Scalability

The figure illustrates the different needs for scalability in Big Data environments as described by (Rys, 2011). Many applications such as Facebook (Fowler, 2012) have a lot of users. Applications should support the large user base and should stay prone to errors in case the application sees unexpected high user numbers. Various techniques can be applied to support different needs such as fast data access. A factor that often – but not only – comes with a high number of users is the data load. (Rys, 2011) describes that some or many users can produce this data. However, things such as sensors and other devices that do not directly relate to users can also produce large datasets. Computational scalability is the ability to scale to large datasets. Data is often analyzed and this needs compute power on the analysis side. Distributed algorithms such as Map/Reduce require a lot of nodes in order to perform queries and analyze in a performing manner. Scale agility describes the possibility to change the environment of a system. This basically means that new instances such as compute can be added or removed on-demand. This requires a high level of automation and virtualization and is very similar to what can be done in cloud computing environments. Several Platforms such as Amazon EC2, Windows Azure, OpenStack, Eucalyptus and others enable this level of self-service that is a great support to scaling agility for Big Data environments.

Big Data 101: Data agility

Agility is an important factor to Big Data Applications. (Rys, 2011) describes 3 different agility factors which are: model agility, operational agility and programming ability.

Data agility
Data agility

Model agility means how easy it is to change the Data Model. Traditionally, in SQL Systems it is rather hard to change a schema. Other Systems such as non-relational Databases allow easy change to the Database. If we look at Key/Value Storages such as DynamoDB (Amazon Web Services, 2013), the change to a Model is very easy. Databases in fast changing systems such as Social Media Applications, Online Shops and other require model agility. Updates to such systems occur frequently, often weekly to daily (Paul, 2012).

In distributed environments, it is often necessary to change operational aspects of a System. New Servers get added often, also with different aspects such as Operating System and Hardware. Database systems should stay tolerant to operational changes, as this is a crucial factor to growth.

Database Systems should support the software developers. This is when programming agility comes into play. Programming agility describes the approach that the Database and all associated SDK’s should easy the live of a developer that is working with the Database itself. Furthermore, it should also support fast development.

Big Data 101: Transformable and Filterable Data

Transformable If data is transformed, it can be changed to a different format or layout. This could as well mean the format change from binary to e.g. Json or XML as well as a totally new representation. If someone wants to look at a specific dataset (which, for instance, could be filtered) not all data might be interesting. Let’s assume that a manager wants to filter for all Customers younger than 18 in a specific district. The manager is probably not interested in the names of the customer but rather in the sum of customers. Instead returning a huge list of Names with addresses and alike, a number is returned. Or the online marketing department wants to target all customers with specific criteria such as age, the address might not be relevant, but Names and E-Mail are. Transformability is also a necessary characteristic if data has to be exported to another database, e.g. for analytics.

Filterable is a key characteristic to Datasets. Analytics software use Filtering frequently and it is absolutely necessary since most analytics simply don’t run on all data but rather on selected Data. Filtered Data is often represented with the “Select … Where”-Clauses in Databases. Most of what filtering of data is good for was already discussed with “Transformability”, however we would still go into detail with that. If we analyze data, it is often necessary to work on specific datasets. Imagine a Google Search Query, where you search for “Big Data”. All Data within Google’s index gets filtered for exactly these Words and a consolidated List is returned. If the online marketing department mentioned in “Transformability” wants a list of customers in a specific area, this List is also filtered based on the Zip Code or other geographical data. Hence it is an important characteristic for Data to support Filtering.