Why building Hadoop on your own doesn’t make sense

There are several things people discuss when it comes to Hadoop and there are some wrong discussions. First, there is a small number of people believing that Hadoop is a hype that will end at some point in time. They often come from a strong DWH background and won’t accept (or simply ignore) the new normal. But there are also some people that basically coin two major sayings: the first group of people states that Hadoop is cheap because it is open source and the second group of people states that Hadoop is expensive because it is very complicated. (Info: by Hadoop, I also include Spark and alike)

Neither the one nor the other is true.

First, you can download it for free and install it on your system. This makes it basically free in terms of licenses, but not in terms of running it. When you get a vanilla Hadoop, you will have to think about hotfixes, updates, services, integration and many more tasks that will get very complicated. This ends up in spending many dollars on Hadoop experts to solve your problems. Remember: you didn’t solve any business problem/question so far, as you are busy running the system! You spend dollars and dollars on expensive operational topics instead of spending them on creating value for your business.

Now, we have the opposite. Hadoop is expensive. Is it? In the past years I saw a lot of Hadoop projects the went more or less bad. Costs were always higher than expected and the project timeframe was never kept. Hadoop experts have a high income as well, which makes consulting hours even more expensive. Plus: you probably won’t find them on the market, as they can select what projects to make. So you have two major problems: high implementation cost and low ressource availability.

Another factor that is relevant to the cost discussion is the cluster utilization. In many projects I could see one trend: when the discussion about cluster sizing is on, there are two main decisions: (a) sizing the cluster to the highest expected utilization or (b) making the cluster smaller than the highest expected utilization. If you select (a), you have another problem: the cluster might be under-utilized. What I could see and what my clients often have, is the following: 20% of the time, they have full utilization on the cluster, but 80% of the time the cluster utilization is below 20%. This basically means that your cluster is very expensive when it comes to business case calculation. If you select (b), you will loose business agility and your projects/analytics might require long compute times.

At the beginning of this article, I promised to explain that Hadoop is still cost-effective. So far, I only stated that it might be expensive, but this would mean that it isn’t cost effective. Hadoop is still cost effective but I will give you a solution in my next blog post on that, so stay tuned ūüėČ

RACEing to agile Big Data Analytics

I am happy to announce the development we did over the last month within Teradata. We developed a light-weight process model for Big Data Analytic projects, which is called “RACE”. The model is agile and resembles the know-how of more than 25 consultants that worked in over 50 Big Data Analytic projects in the recent month. Teradata also developed CRISP-DM, the industry leading process for data mining. Now we invented a new process for agile projects that addresses the new challenges of Big Data Analytics.

Where does the ROI comes from?

This was one of the key questions we addressed when developing RACE. The economics of Big Data Discovery Analytics are different to traditional Integrated Data Warehousing economics. ROI comes from discovering insights in highly iterative projects run over very short time periods (4 to 8 weeks usually) Each meaningful insight or successful use case that can be actioned generates ROI. The total ROI is a sum of all the successful use cases. Competitive Advantage is therefore driven by the capability to produce both a high volume of insights as well as creative insights that generate a high ROI.

What is the purpose of RACE?

RACE is built to deliver a high volume of use cases, focusing on speed and efficiency of production. It fuses data science, business knowledge & creativity to produce high ROI insights

How does the process look like?

RACE - an agile process for Big Data Analytic Projects
RACE – an agile process for Big Data Analytic Projects

The process itself is divided into several short phases:

  • Roadmap.That’s an optional first step (but heavily recommended) to built a roadmap on where the customer wants to go in terms of Big Data.
  • Align.¬†Use-cases are detailed and data is confirmed.
  • Create.¬†Data is loaded, prepared and analyzed. Models are developed
  • Evaluate. Recommendations for the business are given

In the next couple of weeks we will publish much more on RACE, so stay tuned!

What is necessary to achieve interoperability in the Cloud?

What is necessary to achieve interoperability in the Cloud?

As described in the previous sections, 3 major interoperability approaches arise. First, there is the standardisation approach, next there is the middleware approach and last but not least there is the API approach. This is also supported by [Hof09] and [Gov10]. In addition to that, [Gov10] suggests building abstraction layers in order to achieve interoperability and transportability.

There are two main aspects where interoperability is necessary. One level is the management level. This deals with handling the virtual machine(s), applying load balancing, setting DNS settings, auto scaling features and other tasks that come with IaaS solutions. However, this level is mainly necessary in IaaS solutions as PaaS solutions already take care of most of it. The other level is the services level. The services level is basically everything that comes with application services such as messaging, data storage and databases.

interop approaches

Figure: Cloud interoperability approaches

These requirements are described in several relevant papers such as [End10], [Mel09], [Jha09].

Parameswaran et al. describes similar challenges for Cloud interoperability. The authors see two different approaches, the first via a unified cloud interface (UCI) and the second via Enterprise Cloud Orchestration [Par09].

A unified cloud interface is basically an API that is written ‚Äúaround‚ÄĚ other cloud APIs that is vendor specific. It requires some re-writing of and integration. This is similar to the approach selected by Apache jClouds and Apache libcloud.

The Enterprise Cloud Orchestration is a layer where different cloud providers register their services. The platform then provides these services for users like in a discovery. This is very similar to UDDI. The downside of this is that the orchestration layer still needs to integrate all different services (and built wrappers around them). However, it is transparent to the end-user.

Enterprise orchestration layer

Figure: Enterprise Orchestration Layer [Par09]

This post is part of a work done on Cloud interoperability. You can access the full work here and the list of references here.

Discussion of existing standards and frameworks for Cloud Interoperability

As discussed in the previous sections, there are several standards and interoperability frameworks available. Most of them are infrastructure related. The standards and frameworks can generally be clustered into 3 groups.

The first group is the ‚ÄúStandards‚ÄĚ group, which consists of OCCI an the DMTF standards. The second group is the ‚ÄúMiddleware‚ÄĚ group. This group contains mOSAIC, the PaaS Semantic Interoperability Framework and Frascati. The third group is the ‚ÄúLibrary‚ÄĚ group. This group is a concrete implementation that provides a common API for several cloud platforms. The two projects in here are Apache jClouds and Apache libcloud.

Interoperability Solutions for the Cloud
Interoperability Solutions for the Cloud

Figure: Interoperability in the Cloud

OCCI provides great capabilities for infrastructure solutions, but there is nothing done for individual services. The same applies to the standards proposed by the distributed management task force.

As with the libraries and frameworks, a similar picture is drawn. Apache jClouds and Apache libcloud provide some interoperability features for infrastructure services. As for platform services, only the blob storage is available. When developers build their applications, they still run into interoperability challenges.

mOSAIC offers a large number of languages and services, however, it is necessary to build a layer on top of an existing platform. This is not a lightweight solution, as a developer has to maintain both ‚Äď the developed application and the middleware solution installed on top of the provider. The developer eliminates the vendor lock-in, but runs into operational management of the mOSAIC platform. This eliminates a problem on the one side but might create a new one on the other side.

The same problem exists with the PaaS Semantic Interoperability Framework. A user has to install a software on top of the cloud platforms, which then has to be maintained by the user. The goal of platform as a service is to relieve the user from any operational management like in IaaS platforms. Frascati is also a middleware that needs to be maintained.

All of the libraries or frameworks described work for IaaS services. None of them support the PaaS paradigm and services related to it. Apache libcloud and Apache jClouds offer some very fundamental support for the storage service. However, other services such as messaging and key/value storage are not supported.

This post is part of a work done on Cloud interoperability. You can access the full work here and the list of references here.

PaaS Semantic Interoperability Framework (PSIF)

PaaS Semantic Interoperability Framework (PSIF)

Loutas et al. defines semantic interoperability as ‚Äúthe ability of heterogeneous Cloud PaaS systems and their offerings to overcome the semantic incompatibilities and communicate‚ÄĚ [Lou11]. The target of this framework is to give developers the ability to move their application(s) and data seamlessly from one provider to another. Loutas et al. propose a three-dimensional model addressing semantic interoperability for public cloud solutions [Lou11].

Fundamental PaaS Entities

The fundamental PaaS entities consist of several models: the PaaS System, the PaaS Offering, an IaaS-Offering, Software Components and an Application [Lou11].

Levels of Semantic Conflicts

Loutas et al. [Lou11] assumes that there are 3 major semantic conflicts that can be raised for PaaS offerings. The first one is an interoperability problem between the metadata definitions. This occurs when different data models describe one PaaS offering. The second problem is when the same data gets interpreted differently and the third is when different pieces of data have similar meaning. Therefore, [Lou11] uses a two-level approach for solving semantic conflicts. The first level is the Information Model that refers to differences with data and data structures/models. The other level is the data level that refers to differences in the data because of various representations.

Types of Semantics

Three different types of semantics are defined [Lou11]. The first type is the functional semantic. This is basically a representation of everything that a PaaS solution can offer. The second type, the non-functional semantic, is about elements such as pricing or Quality of Service. The third semantic is the execution semantic, which is describing runtime semantics.

This post is part of a work done on Cloud interoperability. You can access the full work here and the list of references here.

Interoperability in the cloud: mOSAIC and Frascati


mOSAIC is a European project supported by the European union [Dan10]. The target of the project was to build a unified Application programing interface (API) for Cloud services that is not only available in Java but also for other languages. mOSAIC is platform and language agnostic. It supports a large number of platforms for this approach. The mOSAIC framework itself is a middleware that runs on top of each cloud provider and abstracts provider specifics. The platform then exposes it’s own API to clients.

The mOSAIC project is built in a layered architecture. On the lowest level, there is the native API or protocol. This is either a ReST, SOAP, RPC or a language-specific library. On the next level, a driver API is found. This API can now be exchanged easily with different platforms such as Amazon’s S3. On top of that is an interoperability-API that allows programming language interoperability. Cloud resources can be access via the connector API. This is also the entry-point for developers, as they access specific models and resources from that API. On top of the connector API is a cloudlet that provides a cloud compliant programming methodology.

Frascati-based Multi PaaS Solution

Frascati is an infrastructure solution that runs on 11 different cloud solutions: Amazon EC2, Amazon Elastic Beanstalk, BitNami, CloudBees, Cloud Foundry, Dot-Cloud, Google App Engine, Heroku, InstaCompute, Jelastic, and OpenShift [Par12].

Frascati follows 3 core principles: An open service model, a configurable Multi-PaaS infrastructure and several infrastructure services. The open service model is an assembly of loosely coupled services based on a Service oriented Architecture. The configurable federated Multi-PaaS Infrastructure is a configurable kernel. Infrastructure services take care of the node provisioning, the deployment of the PaaS service, the deployment of the SaaS service and a federation management service. The Frascati services are installed on top of existing IaaS-services. To work with these services, it is necessary to have access to virtual machines (either Linux or Windows).

This post is part of a work done on Cloud interoperability. You can access the full work here and the list of references here.

Interoperability libraries in the cloud: Apache jClouds and Libcloud

Apache jClouds

Apache jclouds is a framework provided by the Apache Software Foundation. The framework is written in Java and serves the purpose to provide an independent library for typical cloud operations. At present (November 2014), Apache jclouds provides 2 kinds of services: a compute service and a blob service [Apa14b]. Apache jclouds can be used from Java and Clojure. The library offers an abstraction for more than 30 cloud providers, including AWS, Azure, OpenStack and Rackspace.

Apache jclouds is primarily built for infrastructure interoperability. As for the platform layer, only blob storage is currently supported. The focus of jclouds is to support a large variety of platforms over implementing a large variety of services. The Blob storage in Apache jclouds works with the concept of Containers, Folders and Blobs. The library supports access control lists for objects. The upload of Multipart objects is also supported, which allows jclouds to handle large files. [Apa14c]


[Apa14d] Apache libcloud is similar to Apache jClouds. The library is developed as an abstraction to different cloud providers. Libcloud currently supports more than 30 different providers. Libcloud is available as a Python library.

Libcloud is infrastructure-focused with most of it’s implementations done for computing, DNS and Load balancing. There is also an object storage implementation available.

This post is part of a work done on Cloud interoperability. You can access the full work here and the list of references here.