Machine Learning 101 – Clustering, Regression and Classification


In my last post of this series, I explained the concept of supervised, unsupervised and semi-supervised machine learning. In this post, we will go a bit deeper into machine learning (but don’t worry, it won’t be that deep yet!) and look at more concrete topics. But first of all, we have to define some terms, which basically derive from statistics or mathematics. These are:

  • Features
  • Labels

Features are known values, which are often used to calculate results. This are the variables that have an impact on a prediction. If we talk about manufacturing, we might want to reduce junk in our production line. Known features from a machine could then be: Temperature, Humidity, Operator, Time since last service. Based on these Features, we can later calculate the quality of the machine output

Labels are the values we want to build the prediction on. In training data, labels are mostly known, but for the prediction they are not known. When we focus on the machine data example from above, a label would be the quality. So all of the features together make up for a good or bad quality and algorithms can now calculate the quality based on that.

Let’s now go on another “classification” of machine learning techniques. We “cluster” them by supervised/unsupervised.

The first one is clustering. Clustering is an unsupervised technique. With clustering, the algorithm tries to find a pattern in data sets without labels associated with it. This could be a clustering of buying behaviour of customers. Features for this would be the household income, age, … and clusters of different consumers could then be built.

The next one is classification. In contrast to clustering, classification is a supervised technique. Classification algorithms look at existing data and predicts what a new data belongs to. Classification is used for spam for years now and these algorithms are more or less mature in classifying something as spam or not. With machine data, it could be used to predict a material quality by several known parameters (e.g. humidity, strength, color, … ). The output of the material prediction would then be the quality type (either “good” or “bad” or a number in a defined space like 1-10). Another well known sample is if someone would survive the titanic – classification is done by “true” or “false” and input parameters are “age”, “sex”, “class”. If you would be 55, male and in 3rd class, chances are low, but if you are 12, female and in first class, chances are rather high.

The last technique for this post is regression. Regression is often confused with clustering, but it is still different from it. With a regression, no classified labels (such as good or bad, spam or not spam, …) are predicted. Instead, regression outputs continuous, often unbound, numbers. This makes it useful for financial predictions and alike. A common known sample is the prediciton of housing prices, where several values (FEATURES!) are known, such as distance to specific landmarks, plot size,… The algorithms could then predict a price for your house and the amount you can sell it for.

In my next post, I will talk about different algorithms that can be used for such problems.

Advertisements

My Big Data predictions for 2016


As 2016 is around the corner, the question is what this year will bring for Big Data. Here are my top assumptions for the year to come:

  • The growth for relational databases will slow down, as more companies will evaluate Hadoop as an alternative to classic rdbms
  • The Hadoop stack will get more complicated, as more and more projects are added. It will almost take a team to understand what each of these projects does
  • Spark will lead the market for handling data. It will change the entire ecosystem again.
  • Cloud vendors will add more and more capability to their solutions to deal with the increasing demand for workloads in the cloud
  • We will see a dramatic increase of successful use-cases with Hadoop, as the first projects come to a successful end

What do you think about my predictions? Do you agree or disagree?

Big Data and Hadoop E-Books at reduced price


2 Big Data and Hadoop E-Books are available at a special promotion. The reduced price is only valid for 1 week, so make sure to order soon! The offer expires on 21th of December and are available on the Kindle store. The two E-Books are:

  • Big Data (Introduction); 0.99$ instead of 5$: Get it here
  • Hadoop (Introduction); 0.99$ instead of 5$: Get it here

Have fun reading it!

My Cloud predictions for 2016


2016 is around the corner and the question is, what the next year might bring. I’ve added my top 5 predictions that could become relevant for 2016:

  • The Cloud war will intensify. Amazon and Azure will lead the space, followed (with quite some distance) by IBM. Google and Oracle will stay far behind the leading 2+1 Cloud providers. Both Microsoft and Amazon will see significant growth, with Microsoft’s growth being higher, meaning that Microsoft will continue to catch up with Amazon
  • More PaaS Solutions will arrive. All major vendors will provide PaaS solutions on their platform for different use-cases (e.g. Internet of Things). These Solutions will become more industry-specific (e.g. a Solution specific for manufacturing workflows, …)
  • Vendors currently not using the cloud will see declines in their income, as more and more companies move to the cloud
  • Cloud Data Centers will become more often outsourced from the leading providers to local companies, in order to overcome local legislation
  • Big Data in the Cloud will grow significantly in 2016 as more companies will put workload to the Cloud for these kind of applications

What do you think? What are your predictions?

Big Data Europe Meetup in Vienna, 15th of December


On the 15th of December, a Big Data Meetup will take place in Vienna, with leading personals from Fraunhofer, Rapidminer, Teradata et al.

About the Meetup:

The growing digitization and networking process within our society has a large influence on all aspects of everyday life. Large amounts of data are being produced permanently, and when these are analyzed and interlinked they have the potential to create new knowledge and intelligent solutions for economy and society. Big Data can make important contributions to the technical progress in our societal key sectors and help shape business. What is needed are innovative technologies, strategies and competencies for the beneficial use of Big Data to address societal needs.

Climate, Energy, Food, Health, Transport, Security, and Social Sciences – are the most important societal challenges tackled by the European Union within the new research and innovation framework program “Horizon 2020”. In every one of these fields, the processing, analysis and integration of large amounts of data plays a growing role – such as the analysis of medical data, the decentralized supply with renewable energies or the optimization of traffic flow in large cities.

Big Data Europe (BDE, http://www.big-data-europe.eu) will undertake the foundational work for enabling European companies to build innovative multilingual products and services based on semantically interoperable, large-scale, multi-lingual data assets and knowledge, available under a variety of licenses and business models

On 14-15 December 2015 the whole BDE team is meeting in Vienna for a project plenary and thereby around 35 experts in the topic will be participating in the Big Data Europe MeetUp on 15 December 2015 at the Impact Hub Vienna to discuss challenges and requirements and proven solutions for big data management together with the audience.

Agenda
16:00 – 16:10, Welcome & the BDE MeetUp, Vienna – Martin Kaltenböck (SWC)
16:10 – 16:30, The Big Data Europe Project
Sören Auer (Fraunhofer IAIS, BDE Project Lead)
16:30 – 16:45, Big Data Management Models (e.g. RACE)
Mario Meir-Huber (Big Data Lead CEE, Teradata, Vienna – Austria)
16:45 – 17:00, Selected Big Data Projects in Budapest & above,

Zoltan C Toth (Senior Big Data Engineer RapidMiner Inc., Budapest – Hungary)
17:00 – 17:30 Open Discussion with the Panel on Big Data Requirements, Challenges and Solutions.
17:30 – 19:00 Networking & Drinks
Remark: 19:00/30 end of event…

Register here or here.

Conference announcement – Data Natives in Berlin


I am happy to announce that there is a partnership between the Data Natives conference and Cloudvane. Once again, one lucky person can get a free ticket to this conference. The conference takes place from 19th to 20th November in Berlin.

What’s necessary for you to get the ticket:

  • Share the blog post (Twitter, LinkedIn, Facebook) and send the proof of that to me via mail
  • Write a review (ideally with some pictures)

Data Natives focuses on three key areas of innovation: Big Data, IoT and FinTech. The intersection of these product categories is home to the most exciting technology innovation happening today. Whether it’s for individual consumers or multi-billion dollar industries, the opportunity is immense. Come and learn more from leading scientists, founders, analysts, investors and economists coming from Google, SAP, Rocket Internet,Gartner, Forrester among others. Two days full of interesting talks, sharing knowledge from 50+ speakers and engaging with the community of a data driven generation of more than 500 people.

More information on www.datanatives.io 

Thursday, November 19, 8:30AM to Friday, November 20 7:00PM  

NHow Hotel Berlin

Stralauer Allee 3

10245 Berlin

Germany

What everyone is doing wrong about Big Data


I saw so many Big Data “initiatives” in the last month in companies. And guess what? Most of them failed either completely or simply didn’t deliver the results expected. A recent Gartner study even mentioned that only 20% of Hadoop projects are put “live”. But why do these projects fail? What is everyone doing wrong?

Whenever customers are coming to me, they “heard” of what Big Data can help them with. So they looked at 1-3 use cases and now want to have them put into production. However, this is where the problem starts: they are not aware of the fact that also Big Data needs a strategic approach. To get this right, it is necessary to understand the industry (e.g. TelCo, Banking, …) and associated opportunities. To achieve that, a Big Data roadmap has to be built. This is normally done in a couple of workshops with the business. This roadmap will then outline what projects are done in what priority and how to measure results. Therefore, we have a Business Value Framework for different industries, where possible projects are defined.

The other thing I often see is that customers come and say: so now we built a data lake. What should we do with it? We simply can’t find value in our data. This is a totally wrong approach. We often talk about the data lake, but it is not as easy as IT marketing tells us; whenever you build a data lake, you first have to think about what you want to do with it. Why should you know what you might find if you don’t really know what you are looking for? Ever tried searching “something”? If you have no strategy, it is worth nothing and you will find nothing. Therefore, a data lake makes sense, but you need to know what you want to build on top of it. Building a data lake for Big Data is like buying bricks for a house – without knowing where you gonna construct that house and without knowing what the house should finally look like. However, a data lake is necessary to provide great analytics and to run projects on top of that.

Big Data and IT Business alignment
Big Data and IT Business alignment

 

Summing it up, what is necessary for Big Data is to have a clear strategy and vision in place. If you fail to do so, you will end up like many others – being desperate about the promises that didn’t turn out to be true.