Why building Hadoop on your own doesn’t make sense


There are several things people discuss when it comes to Hadoop and there are some wrong discussions. First, there is a small number of people believing that Hadoop is a hype that will end at some point in time. They often come from a strong DWH background and won’t accept (or simply ignore) the new normal. But there are also some people that basically coin two major sayings: the first group of people states that Hadoop is cheap because it is open source and the second group of people states that Hadoop is expensive because it is very complicated. (Info: by Hadoop, I also include Spark and alike)

Neither the one nor the other is true.

First, you can download it for free and install it on your system. This makes it basically free in terms of licenses, but not in terms of running it. When you get a vanilla Hadoop, you will have to think about hotfixes, updates, services, integration and many more tasks that will get very complicated. This ends up in spending many dollars on Hadoop experts to solve your problems. Remember: you didn’t solve any business problem/question so far, as you are busy running the system! You spend dollars and dollars on expensive operational topics instead of spending them on creating value for your business.

Now, we have the opposite. Hadoop is expensive. Is it? In the past years I saw a lot of Hadoop projects the went more or less bad. Costs were always higher than expected and the project timeframe was never kept. Hadoop experts have a high income as well, which makes consulting hours even more expensive. Plus: you probably won’t find them on the market, as they can select what projects to make. So you have two major problems: high implementation cost and low ressource availability.

Another factor that is relevant to the cost discussion is the cluster utilization. In many projects I could see one trend: when the discussion about cluster sizing is on, there are two main decisions: (a) sizing the cluster to the highest expected utilization or (b) making the cluster smaller than the highest expected utilization. If you select (a), you have another problem: the cluster might be under-utilized. What I could see and what my clients often have, is the following: 20% of the time, they have full utilization on the cluster, but 80% of the time the cluster utilization is below 20%. This basically means that your cluster is very expensive when it comes to business case calculation. If you select (b), you will loose business agility and your projects/analytics might require long compute times.

At the beginning of this article, I promised to explain that Hadoop is still cost-effective. So far, I only stated that it might be expensive, but this would mean that it isn’t cost effective. Hadoop is still cost effective but I will give you a solution in my next blog post on that, so stay tuned šŸ˜‰

RACEing to agile Big Data Analytics


I am happy to announce the development we did over the last month within Teradata. We developed a light-weight process model for Big Data Analytic projects, which is called “RACE”. The model is agile and resembles the know-how of more than 25 consultants that worked in over 50 Big Data Analytic projects in the recent month. Teradata also developed CRISP-DM, the industry leading process for data mining. Now we invented a new process for agile projects that addresses the new challenges of Big Data Analytics.

Where does the ROI comes from?

This was one of the key questions we addressed when developing RACE. The economics of Big Data Discovery Analytics are different to traditional Integrated Data Warehousing economics. ROI comes from discovering insights in highly iterative projects run over very short time periods (4 to 8 weeks usually) Each meaningful insight or successful use case that can be actioned generates ROI. The total ROI is a sum of all the successful use cases.Ā Competitive Advantage is therefore driven by the capability to produce both a high volume of insights as well as creative insights that generate a high ROI.

What is the purpose of RACE?

RACE is built to deliver a high volume of use cases, focusing on speed and efficiency of production. It fuses data science, business knowledge & creativity to produce high ROI insights

How does the process look like?

RACE - an agile process for Big Data Analytic Projects
RACE – an agile process for Big Data Analytic Projects

The process itself is divided into several short phases:

  • Roadmap.That’s an optional first step (but heavily recommended) to built a roadmap on where the customer wants to go in terms of Big Data.
  • Align.Ā Use-cases are detailed and data is confirmed.
  • Create.Ā Data is loaded, prepared and analyzed. Models are developed
  • Evaluate. Recommendations for the business are given

In the next couple of weeks we will publish much more on RACE, so stay tuned!

What everyone is doing wrong about Big Data


I saw so many Big Data “initiatives” in the last month in companies. And guess what? Most of them failed either completely or simply didn’t deliver the results expected. A recent Gartner study even mentioned that only 20% of Hadoop projects are put “live”. But why do these projects fail? What is everyone doing wrong?

Whenever customers are coming to me, they “heard” of what Big Data can help them with. So they looked at 1-3 use cases and now want to have them put into production. However, this is where the problem starts: they are not aware of the fact that also Big Data needs a strategic approach. To get this right, it is necessary to understand the industry (e.g. TelCo, Banking, …) and associated opportunities. To achieve that, a Big Data roadmap has to be built. This is normally done in a couple of workshops with the business. This roadmap will then outline what projects are done in what priority and how to measure results. Therefore, we have a Business Value Framework for different industries, where possible projects are defined.

The other thing I often see is that customers come and say: so now we built a data lake. What should we do with it? We simply can’t find value in our data. This is a totally wrong approach. We often talk about the data lake, but it is not as easy as IT marketing tells us; whenever you build a data lake, you first have to think about what you want to do with it. Why should you know what you might find if you don’t really know what you are looking for? Ever tried searching “something”? If you have no strategy, it is worth nothing and you will find nothing. Therefore, a data lake makes sense, but you need to know what you want to build on top of it. Building a data lake for Big Data is like buying bricks for a house – without knowing where you gonna construct that house and without knowing what the house should finally look like. However, a data lake is necessary to provide great analytics and to run projects on top of that.

Big Data and IT Business alignment
Big Data and IT Business alignment

 

Summing it up, what is necessary for Big Data is to have a clear strategy and vision in place. If you fail to do so, you will end up like many others – being desperate about the promises that didn’t turn out to be true.

 

How to kill your Big Data initiative


Everyone is doing Big Data these days. If you don’t work on Big Data projects within your company, you are simply not up to date and don’t know how things work. Big Data solves all of your problems, really!

Well, in reality this is different. It doesn’t solve all your problems. It actually creates more problems then you think! Most companies I saw recently working on Big Data projects failed. They started a Big Data project and successfully wasted thousands of dollars on Big Data projects. But what exactly went wrong?

First of all, Big Data is often only seen as Hadoop. We live with the mis-perception that only Hadoop can solve all Big Data topics. This simply isn’t true. Hadoop can do many things – but real data science is often not done with the core of Hadoop. Ever talked to someone doing the analytics (e.g someone good in math or statistics)?. They are not ok with writing Java Map/Reduce queries or Pig/Hive scripts. They want to work with other tools that are ways more interactive.

The other thing is that most Big Data initiatives are often handled wrong. Most initiatives often simply don’t include someone being good in analytics. One simply doesn’t find this type of person in an IT team – the person has to be found somewhere else. Failing to include someone with this skills often leads to finding “nothing” in the data – because IT staff is good in writing queries – but not in doing complex analytics. These skills are actually not thought in IT classes – it requires a totally different study field to reach this skill set.

Hadoop as the solution to everything for many IT departments. However, projects often stop with implementing Hadoop. Most Hadoop implementations never leave the pilot phase. This is often due to the fact that IT departments see Hadoop as a fun thing to play with – but getting this into production requires a different approach. There are actually more solutions out there that can be done when delivering a Big Data project.

A key to ruining your Big Data project is not involving the LoB. The IT department often doesn’t know what questions to ask. So how can they know the answer and try to find the question? The LoB sees that different. They see an answer – and know what question it would be.

The key to kill your Big Data initiative is exactly one thing: go with the hype. Implement Hadoop and don’t think about what you actually want to achieve with it. Forget the use-case, just go and play with the fancy technology. NOT

As long as companies will stich to that, I am sure I will have enough work to do. I “inherited” several failed projects and turned them into success. So, please continue.

Why Big Data projects are challenging – and why I love it


During my professional carrier, I was managing several IT projects, mainly in the distributed systems environment. Initially, these projects were cloud projects, that were rather easy. I worked with IT departments in different domains/industries and we all had the same level of “vocabulary”. When talking with IT staff, it is clear that all use the same terms to describe “things”. No special explanation is needed.

I soon realized that Big Data projects are VERY different to that. I wrote several posts on Big Data challenges in the last month and the requirements for data scientists and alike. What I am always coming across when managing Big Data projects is the different approach one have to select when (successfully) managing these kind of projects.

Let me first start by explaining what I am doing. First of all, I don’t code, implement or create any kind of infrastructure. I work with senior (IT) staff to talk about ideas which will eventually be transformed to Big Data projects (either direct or indirect). My task is to work with them on what Big Data can achieve for their organization and/or problem. I am not discussion how their Hadoop solution will look like, I am working on use-cases and challenges/opportunities for their problems, independent from a concrete technology. Plus, I am not focused on any specific industry or domain.

However, all strategic Big Data projects have a concrete schema. The most challenging part is to understand the problem. In the last month, I had some challenges in different industries; whenever I run these kind of projects, it is mainly about cooperating with the domain experts. They often have no idea about the possibilities of Big Data – and they don’t have to. I, in contrast, have no idea about the domain itself. This is challenging on the one side – but very helpful on the other side. The more experience one person gains within a specific domain, the more the person thinks and acts in the methodology for the specific domain. They often don’t see the solution because they work on a “I’ve made this experience and it has to be very similar”. The same applies to me as a Big Data expert. All workshops I ran were mainly about mixing the concrete domain with the possibilities of Big Data.

I had a number of interesting projects lately. One of the projects was in the geriatric care domain. We worked on how data can make the live of elderly better and what type of data is needed. It was very interesting to work with domain experts and see what challenges they actually face. An almost funny discussion arose around Open Data – we looked at several data sources provided by the state and I mentioned: “sorry, but we can’t use these data sources. They aren’tĀ big and they are about locations of toilets within our capital city”. However, their opinion was different because the location of toilets is very important for them – and data doesn’t always needs to be big, it needs to be valuable. Another project was in the utilities domain, where it was about improving their supply chain by optimizing it with data. Another project for a company providing devices was about improving the reliability of their devices by analyzing large amounts of log data. When their devices have an outage, a service personal has to go to the city of the outage. This takes several days to a week. I worked on reducing this time and included a data scientist. We could reduce the time the device stands still to some hours only for the 3 mayor error codes by finding patterns weeks before the outage occurs. However, there is still much work to be done in that area. Another project was in the utilities sector and in the government sector.

All of these projects had a common iteration phase but were very different – each project had it’s own challenges, but the key success factor for me was how to deal with people – it was very important to work with different people from different domains with a different mindset – improving my knowledge and broadening my horizon as well. That’s challenging on the one handĀ but very exciting on the other hand.

Big Data: what or who is the data scientist?


As described in an earlier post here I outlined the fact that becoming a data scientist requires a lot of knowledge.

Focusing back, a data scientist needs to have knowledge in different IT domains:

  • General understanding of distributed systems and how they work. This includes administration skills for Linux as well as hardware related skills such as networking.
  • Knowledge in Hadoop or similar technologies. This knowledge basically builds on top of the former one but it is sort of different and requires a more software focused knowledge.
  • Great statistical/mathematical knowledge. This is necessary to actually work on the required tasks and to figure out how they can be applied to real algorithms.
  • Presentation skills. All is worth nothing as long as someone can’t represent the data or things found in the data. The management might not see the points if the person can’t present data in an appropriate way.

In addition, there are some other skills necessary:

  • Knowledge of the legal situation. The legal basics are different from country to country. Though the european union gives some legal borders within member states, there are also differences.
  • Knowledge of the society impacts. It is also necessary to understand how society might react to data analysis. Especially in marketing it is absolutely necessary to handle that correct

Since more and more IT companies focus on looking for the ideal data scientist, people should first try to find out who is capable of handling all of these skills. The answer to this might be: there is no person that can handle all. It is likely that one person is great in distributed systems and Hadoop but might fail in transforming questions to algorithms and finally presenting them.

Data Science is more of a team effort than a single person that can handle all of it. Therefore, it is rather necessary to build a team that will be able to address all of these challenges.

Privacy killed the Big Data star


Big Data is all about limiting our privacy. With Big Data, we get no privacy at all. Hello, Big Brother is watching us and we have to stop it right now!

Well, this is far too cruel. Big Data is NOT all about limiting our privacy. Just to make it clear: I see the benefits of Big Data. However, there are a lot of people out there that are afraid of Big Data because of privacy. The thing I want to state first: Big Data is not NSA, Privacy, Facebook or whatever surveillance technology you can think of. Of course, it is often enabled by Big Data technologies. I see this discussion often and I recently came across an event, that stated stated that Big Data is bad and it limits our privacy. I say, this is bullsh##.

The event I am talking about stated that Big Data is bad, it is limiting our privacy and it needs to be stopped. It is a statement that only sees one side of the topic. I agree that the continuous monitoring of people by secret services isn’t great and we need to do something about it. But this is not Big Data. I agree that Facebook is limiting my privacy. I significantly reduced the amount of time spending on Facebook and don’t use the mobile Apps. This needs to change.

However, this not Big Data. This are companies/organisations doing something that is not ok. Big Data is much more than that. Big Data is not just evil, it is great for many aspects:

  • Big Data in healthcare can save thousands, if not millions of lives by improving medicine, vaccination and finding correlations for chronically ill people to improve their treatment. Nowadays, we can decode the DNA in short time, which helps a lot of people!
  • Big Data in agriculture can improve how we produce foods. Since the global population is growing, we need to get more productive in order to feed everyone.
  • Big Data can improve the stability and reliability of IT systems by providing real-time analytics. Logs are analysed in real-time to react to incidents before they happen.
  • Big Data can – and actually does – improve the reliability of devices and machines. An example is that of medicine devices. A company in this field could reduce the time the devices had an outage from weeks to only hours! This does not just save money, it also saves lives!
  • There are many other use-cases in that field, where Big Data is great

We need to start to work together instead of just calling something bad because it seems to be so. No technology is good or evil, there are always some bad things but also some good things. It is necessary to see all sides of a technology. The conference I was talking about gave me the inspiration to write this article as it is so small-minded.