International Data Science Conference, Salzburg


Hi,

I am happy to share this exciting conference I am keynoting at. Also, Mike Ohlsen from Cloudera will deliver a keynote at the conference.

About the conference:

June 12th – 13th 2017 | Salzburg, Austria | www.idsc.at

The 1st International Data Science Conference (iDSC 2017) organized by Salzburg University of Applied Sciences (Information Technology and Systems Management) in cooperation with Information Professionals GmbH seeks to establish a key Data Science event, providing a forum for an international exchange on Data Science technologies and applications.

The International Data Science Conference gives the participants the opportunity, over the course of two days, to delve into the most current research and up-to-date practice in Data Science and data-driven business. Besides the two parallel tracks, the Research Track and the Industry Track, on the second day a Symposium is taking place presenting the outcomes of a European Project on Text and Data Mining (TDM). These events are open to all participants.

Also we are proud to announce keynote presentations from Mike Olson (Chief Strategy Officer Cloudera), Ralf Klinkenberg (General Manager RapidMiner), Euro Beinat (Data-Science Professor and Managing Director CS Research), Mario Meir-Huber (Big Data Architect Microsoft). These keynotes will be distributed over both conference days, providing times for all participants to come together and share views on challenges and trends in Data Science.

The Research Track offers a series of short presentations from Data Science researchers on their own, current papers. On both conference days, we are planning a morning and an afternoon session presenting the results of innovative research into data mining, machine learning, data management and the entire spectrum of Data Science.

The Industry Track showcases real practitioners of data-driven business and how they use Data Science to help achieve organizational goals. Though not restricted to these topics only, the industry talks will concentrate on our broad focus areas of manufacturing, retail and social good. Users of data technologies can meet with peers and exchange ideas and solutions to the practical challenges of data-driven business.

Futhermore the Symposium is organized in collaboration with the FutureTDM Consortium. FutureTDM is a European project which over the last two years has been identifying the legal and technical barriers, as well as the skills stakeholders/practitioners lack, that inhibit the uptake of text and data mining for researchers and innovative businesses. The recommendations and guidelines recognized and proposed to counterbalance these barriers, so as to ensure broader TDM uptake and thus boost Europe’s research and innovation capacities, will be the focus of the Symposium.

Our sponsors ClouderaF&F and um etc. will have their own, special platform: half-day workshops to provide hands-on interaction with tools or to learn approaches to developing concrete solutions. In addition, there will be an exhibition of the sponsors’ products and services throughout the conference, with the opportunity for the participants to seek contact and advice.

The iDSC 2017 is therefore a unique meeting place for researchers, business managers, and data scientists to discover novel approaches and to share solutions to the challenges of a data-driven world.

Advertisements

Why building Hadoop on your own doesn’t make sense


There are several things people discuss when it comes to Hadoop and there are some wrong discussions. First, there is a small number of people believing that Hadoop is a hype that will end at some point in time. They often come from a strong DWH background and won’t accept (or simply ignore) the new normal. But there are also some people that basically coin two major sayings: the first group of people states that Hadoop is cheap because it is open source and the second group of people states that Hadoop is expensive because it is very complicated. (Info: by Hadoop, I also include Spark and alike)

Neither the one nor the other is true.

First, you can download it for free and install it on your system. This makes it basically free in terms of licenses, but not in terms of running it. When you get a vanilla Hadoop, you will have to think about hotfixes, updates, services, integration and many more tasks that will get very complicated. This ends up in spending many dollars on Hadoop experts to solve your problems. Remember: you didn’t solve any business problem/question so far, as you are busy running the system! You spend dollars and dollars on expensive operational topics instead of spending them on creating value for your business.

Now, we have the opposite. Hadoop is expensive. Is it? In the past years I saw a lot of Hadoop projects the went more or less bad. Costs were always higher than expected and the project timeframe was never kept. Hadoop experts have a high income as well, which makes consulting hours even more expensive. Plus: you probably won’t find them on the market, as they can select what projects to make. So you have two major problems: high implementation cost and low ressource availability.

Another factor that is relevant to the cost discussion is the cluster utilization. In many projects I could see one trend: when the discussion about cluster sizing is on, there are two main decisions: (a) sizing the cluster to the highest expected utilization or (b) making the cluster smaller than the highest expected utilization. If you select (a), you have another problem: the cluster might be under-utilized. What I could see and what my clients often have, is the following: 20% of the time, they have full utilization on the cluster, but 80% of the time the cluster utilization is below 20%. This basically means that your cluster is very expensive when it comes to business case calculation. If you select (b), you will loose business agility and your projects/analytics might require long compute times.

At the beginning of this article, I promised to explain that Hadoop is still cost-effective. So far, I only stated that it might be expensive, but this would mean that it isn’t cost effective. Hadoop is still cost effective but I will give you a solution in my next blog post on that, so stay tuned 😉

My Big Data predictions for 2016


As 2016 is around the corner, the question is what this year will bring for Big Data. Here are my top assumptions for the year to come:

  • The growth for relational databases will slow down, as more companies will evaluate Hadoop as an alternative to classic rdbms
  • The Hadoop stack will get more complicated, as more and more projects are added. It will almost take a team to understand what each of these projects does
  • Spark will lead the market for handling data. It will change the entire ecosystem again.
  • Cloud vendors will add more and more capability to their solutions to deal with the increasing demand for workloads in the cloud
  • We will see a dramatic increase of successful use-cases with Hadoop, as the first projects come to a successful end

What do you think about my predictions? Do you agree or disagree?

Big Data and Hadoop E-Books at reduced price


2 Big Data and Hadoop E-Books are available at a special promotion. The reduced price is only valid for 1 week, so make sure to order soon! The offer expires on 21th of December and are available on the Kindle store. The two E-Books are:

  • Big Data (Introduction); 0.99$ instead of 5$: Get it here
  • Hadoop (Introduction); 0.99$ instead of 5$: Get it here

Have fun reading it!

What everyone is doing wrong about Big Data


I saw so many Big Data “initiatives” in the last month in companies. And guess what? Most of them failed either completely or simply didn’t deliver the results expected. A recent Gartner study even mentioned that only 20% of Hadoop projects are put “live”. But why do these projects fail? What is everyone doing wrong?

Whenever customers are coming to me, they “heard” of what Big Data can help them with. So they looked at 1-3 use cases and now want to have them put into production. However, this is where the problem starts: they are not aware of the fact that also Big Data needs a strategic approach. To get this right, it is necessary to understand the industry (e.g. TelCo, Banking, …) and associated opportunities. To achieve that, a Big Data roadmap has to be built. This is normally done in a couple of workshops with the business. This roadmap will then outline what projects are done in what priority and how to measure results. Therefore, we have a Business Value Framework for different industries, where possible projects are defined.

The other thing I often see is that customers come and say: so now we built a data lake. What should we do with it? We simply can’t find value in our data. This is a totally wrong approach. We often talk about the data lake, but it is not as easy as IT marketing tells us; whenever you build a data lake, you first have to think about what you want to do with it. Why should you know what you might find if you don’t really know what you are looking for? Ever tried searching “something”? If you have no strategy, it is worth nothing and you will find nothing. Therefore, a data lake makes sense, but you need to know what you want to build on top of it. Building a data lake for Big Data is like buying bricks for a house – without knowing where you gonna construct that house and without knowing what the house should finally look like. However, a data lake is necessary to provide great analytics and to run projects on top of that.

Big Data and IT Business alignment
Big Data and IT Business alignment

 

Summing it up, what is necessary for Big Data is to have a clear strategy and vision in place. If you fail to do so, you will end up like many others – being desperate about the promises that didn’t turn out to be true.

 

Amazon Web Services S3 upload bug


The AWS Java SDK Version 1.8.10 comes with a critical bug, affecting uploads. A fix was provided by AWS and normally the SDK is updated automatically, so you don’t need to worry.

However, if automatic updates are disabled in your Eclipse Version, you might loose data when uploading via the SDK Version 1.8.10. Here is what AWS has to say about the bug:

//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js //

[av_notification title=’AWS Message’ color=’red’ border=’solid’ custom_bg=’#444444′ custom_font=’#ffffff’ size=’large’ icon_select=’yes’ icon=’ue800′ font=’entypo-fontello’]
Users of AWS SDK for Java 1.8.10 are urged to immediately update to the latest version of the SDK, version 1.8.11.

If you’ve already upgraded to 1.8.11, you can safely ignore this message.

Version 1.8.10 has a potential for data loss when uploading data to Amazon S3 under certain conditions. Data loss can occur if an upload request using an InputStream with no user-specified content-length fails and is automatically retried by the SDK.

The latest version of the AWS SDK for Java can be downloaded here:
http://aws.amazon.com/sdk-for-java/

And is also available through Maven central:
http://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.8.11/

[/av_notification]

The bug itself is repaired, in case you didn’t update the AWS SDK and are on the SDK Version 1.8.10 you should update that. Normally, the AWS SDK updates itself automatically in Eclipse.