Big Data and Hadoop E-Books at reduced price


2 Big Data and Hadoop E-Books are available at a special promotion. The reduced price is only valid for 1 week, so make sure to order soon! The offer expires on 21th of December and are available on the Kindle store. The two E-Books are:

  • Big Data (Introduction); 0.99$ instead of 5$: Get it here
  • Hadoop (Introduction); 0.99$ instead of 5$: Get it here

Have fun reading it!

Advertisements

My Cloud predictions for 2016


2016 is around the corner and the question is, what the next year might bring. I’ve added my top 5 predictions that could become relevant for 2016:

  • The Cloud war will intensify. Amazon and Azure will lead the space, followed (with quite some distance) by IBM. Google and Oracle will stay far behind the leading 2+1 Cloud providers. Both Microsoft and Amazon will see significant growth, with Microsoft’s growth being higher, meaning that Microsoft will continue to catch up with Amazon
  • More PaaS Solutions will arrive. All major vendors will provide PaaS solutions on their platform for different use-cases (e.g. Internet of Things). These Solutions will become more industry-specific (e.g. a Solution specific for manufacturing workflows, …)
  • Vendors currently not using the cloud will see declines in their income, as more and more companies move to the cloud
  • Cloud Data Centers will become more often outsourced from the leading providers to local companies, in order to overcome local legislation
  • Big Data in the Cloud will grow significantly in 2016 as more companies will put workload to the Cloud for these kind of applications

What do you think? What are your predictions?

RACEing to agile Big Data Analytics


I am happy to announce the development we did over the last month within Teradata. We developed a light-weight process model for Big Data Analytic projects, which is called “RACE”. The model is agile and resembles the know-how of more than 25 consultants that worked in over 50 Big Data Analytic projects in the recent month. Teradata also developed CRISP-DM, the industry leading process for data mining. Now we invented a new process for agile projects that addresses the new challenges of Big Data Analytics.

Where does the ROI comes from?

This was one of the key questions we addressed when developing RACE. The economics of Big Data Discovery Analytics are different to traditional Integrated Data Warehousing economics. ROI comes from discovering insights in highly iterative projects run over very short time periods (4 to 8 weeks usually) Each meaningful insight or successful use case that can be actioned generates ROI. The total ROI is a sum of all the successful use cases. Competitive Advantage is therefore driven by the capability to produce both a high volume of insights as well as creative insights that generate a high ROI.

What is the purpose of RACE?

RACE is built to deliver a high volume of use cases, focusing on speed and efficiency of production. It fuses data science, business knowledge & creativity to produce high ROI insights

How does the process look like?

RACE - an agile process for Big Data Analytic Projects
RACE – an agile process for Big Data Analytic Projects

The process itself is divided into several short phases:

  • Roadmap.That’s an optional first step (but heavily recommended) to built a roadmap on where the customer wants to go in terms of Big Data.
  • Align. Use-cases are detailed and data is confirmed.
  • Create. Data is loaded, prepared and analyzed. Models are developed
  • Evaluate. Recommendations for the business are given

In the next couple of weeks we will publish much more on RACE, so stay tuned!

Amazon's Cloud Business growing fast, creating surplus


Amazon announced details about their Q2 earnings yesterday. Their cloud business grew with incredible 81%. This is massive, given the fact that Amazon is already the number #1 company in that area. This quarter, they earned 1.8 billion USD from cloud computing.

Summing up this number, their revenue would definitively reach some 7 billion this year. However, if this growth continues to increase so fast, I guess they could even get double-digit by the end of this year. Will Amazon reach 10 billion in 2015? If so, this would be incredible! Microsoft stated that their growth was somewhere well above the 100% mark, so I am interested in where Microsoft will stand by the end of the year.

But what does this tell us? Both Microsoft and Amazon are growing fast in this business and we can expect that we will see many more interesting services in the coming month and years in the Cloud. My opinion is that the market is already consolidated between Microsoft and Amazon. Other companies such as Google and Oracle are rather niche players in the Cloud market.

Hadoop Tutorial – Getting started with Apache Hadoop


Hadoop is one of the most popular Big Data technologies, or maybe the key Big Data technology. Due to large demand for Hadoop, I’ve decided to write a short Hadoop tutorial series here. In the next weeks, I will write several articles on the Hadoop platform and key technologies.

When we talk about Hadoop, we don’t talk about one specific software or a service. The Hadoop project features several projects, each of them serving different topics in the Big Data ecosystem. When handling Data, Hadoop is very different to traditional RDBMS systems. Key differences are:

  • Hadoop is about large amounts of data. Traditional database systems were only about some gigabyte or terabyte of data, Hadoop can handle much more. Petabytes are not a problem for Hadoop
  • RDBMS work with an interactive access to data, whereas Hadoop is batch-oriented.
  • With traditional database systems, the approach was “read many, write many”. That means, that data gets written often but also modified often. With Hadoop, this is different: the approach now is “write once, read many”. This means that data is written once and then never gets changed. The only purpose is to read the data for analytics.
  • RDBMS systems have schemas. When you design an application, you first need to create the schema of the database. With Hadoop, this is different: the schema is very flexible, it is actually schema-less
  • Last but not least, Hadoop scales linear. If you add 10% more compute capacity, you will get about the same amount of performance. RDBMS are different; at a certain point, scaling them gets really difficult.

Central to Hadoop is the Map/Reduce algorithm. This algorithm was usually introduced by Google to power their search engine. However, the algorithm turned out to be very efficient for distributed systems, so it is nowadays used in many technologies. When you run queries in Hadoop with languages such as Hive or Pig (I will explain them later), these queries are translated to Map/Reduce algorithms by Hadoop. The following figure shows the Map/Reduce algorithm:

Map Reduce function
Map Reduce function

The Map/Reduce function has some steps:

  1. All input data is distributed to the Map functions
  2. The Map functions are running in parallel. The distribution and failover is handled entirely by Hadoop.
  3. The Map functions emit data to a temporary storage
  4. The Reduce function now calculates the temporary stored data

A typical sample is the word-count. With word-count, input data as text is put to a Map function. The Map function adds all words of the same kind to a list in the temporary store. The reduce-function now counts the words and builds a sum.

Next week I will blog about the different Hadoop projects. As already mentioned earlier, Hadoop consists of several other projects.

Privacy killed the Big Data star


Big Data is all about limiting our privacy. With Big Data, we get no privacy at all. Hello, Big Brother is watching us and we have to stop it right now!

Well, this is far too cruel. Big Data is NOT all about limiting our privacy. Just to make it clear: I see the benefits of Big Data. However, there are a lot of people out there that are afraid of Big Data because of privacy. The thing I want to state first: Big Data is not NSA, Privacy, Facebook or whatever surveillance technology you can think of. Of course, it is often enabled by Big Data technologies. I see this discussion often and I recently came across an event, that stated stated that Big Data is bad and it limits our privacy. I say, this is bullsh##.

The event I am talking about stated that Big Data is bad, it is limiting our privacy and it needs to be stopped. It is a statement that only sees one side of the topic. I agree that the continuous monitoring of people by secret services isn’t great and we need to do something about it. But this is not Big Data. I agree that Facebook is limiting my privacy. I significantly reduced the amount of time spending on Facebook and don’t use the mobile Apps. This needs to change.

However, this not Big Data. This are companies/organisations doing something that is not ok. Big Data is much more than that. Big Data is not just evil, it is great for many aspects:

  • Big Data in healthcare can save thousands, if not millions of lives by improving medicine, vaccination and finding correlations for chronically ill people to improve their treatment. Nowadays, we can decode the DNA in short time, which helps a lot of people!
  • Big Data in agriculture can improve how we produce foods. Since the global population is growing, we need to get more productive in order to feed everyone.
  • Big Data can improve the stability and reliability of IT systems by providing real-time analytics. Logs are analysed in real-time to react to incidents before they happen.
  • Big Data can – and actually does – improve the reliability of devices and machines. An example is that of medicine devices. A company in this field could reduce the time the devices had an outage from weeks to only hours! This does not just save money, it also saves lives!
  • There are many other use-cases in that field, where Big Data is great

We need to start to work together instead of just calling something bad because it seems to be so. No technology is good or evil, there are always some bad things but also some good things. It is necessary to see all sides of a technology. The conference I was talking about gave me the inspiration to write this article as it is so small-minded.

Future Technologies that have an impact on Cloud and Big Data


As of future technologies, Cloud Computing and Big Data aren’t a future anymore. They are here, right now and more and more of us start to deal with these technologies. Even when you watch TV, a reference to the cloud is often made. But there are several other technologies that will have a certain impact on Cloud Computing and Big Data. These technologies are different to Cloud and Big Data but will utilize that and use it as an important basis and back end.

Future Emerging Technologies using Cloud and Big Data
Future Emerging Technologies using Cloud and Big Data

The technologies are:

  • Smart Cities
  • Smart Homes
  • Smart Production
  • Autonomous Systems
  • Smart Logistics
  • Internet of Things

All these technologies work together and have the Cloud as back end. Furthermore, they use Big Data concepts and technologies. Summing these technologies up, they can be described as “cyber-physical systems”. This basically means that the virtual world we were used to until now moves stronger into the physical world. These two worlds will merge together and form something totally new. In the upcoming weeks I will outline each topic in detail, so stay tuned and subscribe to this tag to get the updates.

Header Image Copyright by Pascal, licensed under the Creative Commons 2.0 license.