Machine Learning 101 – Clustering, Regression and Classification


In my last post of this series, I explained the concept of supervised, unsupervised and semi-supervised machine learning. In this post, we will go a bit deeper into machine learning (but don’t worry, it won’t be that deep yet!) and look at more concrete topics. But first of all, we have to define some terms, which basically derive from statistics or mathematics. These are:

  • Features
  • Labels

Features are known values, which are often used to calculate results. This are the variables that have an impact on a prediction. If we talk about manufacturing, we might want to reduce junk in our production line. Known features from a machine could then be: Temperature, Humidity, Operator, Time since last service. Based on these Features, we can later calculate the quality of the machine output

Labels are the values we want to build the prediction on. In training data, labels are mostly known, but for the prediction they are not known. When we focus on the machine data example from above, a label would be the quality. So all of the features together make up for a good or bad quality and algorithms can now calculate the quality based on that.

Let’s now go on another “classification” of machine learning techniques. We “cluster” them by supervised/unsupervised.

The first one is clustering. Clustering is an unsupervised technique. With clustering, the algorithm tries to find a pattern in data sets without labels associated with it. This could be a clustering of buying behaviour of customers. Features for this would be the household income, age, … and clusters of different consumers could then be built.

The next one is classification. In contrast to clustering, classification is a supervised technique. Classification algorithms look at existing data and predicts what a new data belongs to. Classification is used for spam for years now and these algorithms are more or less mature in classifying something as spam or not. With machine data, it could be used to predict a material quality by several known parameters (e.g. humidity, strength, color, … ). The output of the material prediction would then be the quality type (either “good” or “bad” or a number in a defined space like 1-10). Another well known sample is if someone would survive the titanic – classification is done by “true” or “false” and input parameters are “age”, “sex”, “class”. If you would be 55, male and in 3rd class, chances are low, but if you are 12, female and in first class, chances are rather high.

The last technique for this post is regression. Regression is often confused with clustering, but it is still different from it. With a regression, no classified labels (such as good or bad, spam or not spam, …) are predicted. Instead, regression outputs continuous, often unbound, numbers. This makes it useful for financial predictions and alike. A common known sample is the prediciton of housing prices, where several values (FEATURES!) are known, such as distance to specific landmarks, plot size,… The algorithms could then predict a price for your house and the amount you can sell it for.

In my next post, I will talk about different algorithms that can be used for such problems.

Advertisements

International Data Science Conference, Salzburg


Hi,

I am happy to share this exciting conference I am keynoting at. Also, Mike Ohlsen from Cloudera will deliver a keynote at the conference.

About the conference:

June 12th – 13th 2017 | Salzburg, Austria | www.idsc.at

The 1st International Data Science Conference (iDSC 2017) organized by Salzburg University of Applied Sciences (Information Technology and Systems Management) in cooperation with Information Professionals GmbH seeks to establish a key Data Science event, providing a forum for an international exchange on Data Science technologies and applications.

The International Data Science Conference gives the participants the opportunity, over the course of two days, to delve into the most current research and up-to-date practice in Data Science and data-driven business. Besides the two parallel tracks, the Research Track and the Industry Track, on the second day a Symposium is taking place presenting the outcomes of a European Project on Text and Data Mining (TDM). These events are open to all participants.

Also we are proud to announce keynote presentations from Mike Olson (Chief Strategy Officer Cloudera), Ralf Klinkenberg (General Manager RapidMiner), Euro Beinat (Data-Science Professor and Managing Director CS Research), Mario Meir-Huber (Big Data Architect Microsoft). These keynotes will be distributed over both conference days, providing times for all participants to come together and share views on challenges and trends in Data Science.

The Research Track offers a series of short presentations from Data Science researchers on their own, current papers. On both conference days, we are planning a morning and an afternoon session presenting the results of innovative research into data mining, machine learning, data management and the entire spectrum of Data Science.

The Industry Track showcases real practitioners of data-driven business and how they use Data Science to help achieve organizational goals. Though not restricted to these topics only, the industry talks will concentrate on our broad focus areas of manufacturing, retail and social good. Users of data technologies can meet with peers and exchange ideas and solutions to the practical challenges of data-driven business.

Futhermore the Symposium is organized in collaboration with the FutureTDM Consortium. FutureTDM is a European project which over the last two years has been identifying the legal and technical barriers, as well as the skills stakeholders/practitioners lack, that inhibit the uptake of text and data mining for researchers and innovative businesses. The recommendations and guidelines recognized and proposed to counterbalance these barriers, so as to ensure broader TDM uptake and thus boost Europe’s research and innovation capacities, will be the focus of the Symposium.

Our sponsors ClouderaF&F and um etc. will have their own, special platform: half-day workshops to provide hands-on interaction with tools or to learn approaches to developing concrete solutions. In addition, there will be an exhibition of the sponsors’ products and services throughout the conference, with the opportunity for the participants to seek contact and advice.

The iDSC 2017 is therefore a unique meeting place for researchers, business managers, and data scientists to discover novel approaches and to share solutions to the challenges of a data-driven world.

Hadoop Tutorial – Data Science with Apache Mahout


Apache Mahout is the service on Hadoop that is in charge of what is often called “data science”. Mahout is all about learning algorithms, pattern recognition and alike. An interesting fact about Mahout is that under the hood MapReduce was replaced by Spark.

Mahout is in charge of the following tasks:

  • Machine Learning. Learning from existing data and.
  • Recommendation Mining. This is what we often see at websites. Remember the “You bought X, you might be interested in Y”? This is exactly what Mahout can do for you.
  • Cluster data. Mahout can cluster documents and data that has some similarities.
  • Classification. Learn from existing classifications.

A Mahout program is written in Java. The next listing shows how the recommendation builder works.

DataModel model = new FileDataModel(new File(“/home/var/mydata.xml”));

 

RecommenderEvaluator eval = new AverageAbsoluteDifferenceRecommenderEvaluator();

 

RecommenderBuilder builder = new MyRecommenderBuilder();

 

Double res = eval.evaluate(builder, null, model, 0.9, 1.0);

 

System.out.println(result);

A Mahout program

Big Data in Manufacturing


Big Data is a disruptive technology. It is changing major industries from the inside. In the next posts, we will learn how Big Data changes different industries.

Today’s focus: Big Data in Manufacturing.

Manufacturing is a traditional industry relevant to almost any country in the world. It started to emerge in the industrial revolution, when machines took over and production became more and more automated. Big Data has the possibility to substantially change the manufacturing industry again – with various opportunities.

Manufactures can utilize Big Data for various reasons. First, it is all about quality. When we look at production chains, may it be producing a car or just some metal works, quality is key. Who wants to buy a car that is broken? Exactly, nobody. Improving quality is a key aspect in Big Data for manufacturers. As of Big Data, this can come with several aspects. First of all, it is necessary to collect data about the production line(s) and all devices that are connected or connect-able. When errors occur or a product isn’t as desired, the production data can be analyzed and reviewed. Data scientists basically do a great job on that. Real-Time analytics allow the company to improve the material quality and product quality again. This can be done by analyzing images of products or materials and removing them from the production line in case they don’t fulfill certain standards.

A key challenge today in manufacturing is the high degree of product customization. When buying a new car, the words by Henry Ford (you can have any type of the T-model as long as it is black) are not true any more. When customers order whatever type of product, customers expect that their own personality is reflected by the product. If a company fails to deliver that, they might risk loosing customers. But what is the affiliation with Big Data now? Well, this customization is a strong shift towards Industry 4.0, which is heavily promoted by the German industry. In order to make products customize able, it is necessary to have an automated product line and to know what customers might want – by analyzing recent sales and trends from social networks and alike.

Changing the output of a production line is often difficult and ineffective. Big Data analytics allow manufacturers to better understand future demands and they can reduce production pikes. This enables the manufacturer to better plan and act in the market – and get more efficient.

Big Data 101: Partitioning


Partitioning is another factor for Big Data Applications. It is one of the factors of the CAP-Theorem (see 1.6.1) and is also important for scaling applications. Partitioning basically describes the ability to distribute a database over different servers. In Big Data Applications, it is often not possible to store everything on one (Josuttis, 2011)

Data Partitioning
Data Partitioning

The factors for partitioning illustrated in the Figure: Partitioning are described by (Rys, 2011). Functional partitioning is basically describing the service oriented architecture (SOA) approach (Josuttis, 2011). With SOA, different functions are provided by their own services. If we talk about a Web shop such as Amazon, there are a lot of different services involved. Some Services handle the Order Workflow; other Services handle the search and so on. If there is high load on a specific service such as the shopping cart, new instances can be added on demand. This reduces the risk of an outage that would lead to loosing money. Building a service-oriented architecture simply doesn’t solve all problems for partitioning. Therefore, data also has to be partitioned. By data partitioning, all data is distributed over different servers. They can also be distributed geographically. A partition key basically identifies partitioned Data. Since there is a lot of data available and single nodes may fail, it is necessary to partition data in the network. This means that data should be replicated and stored redundant in order to deal with node failures.

Big Data 101: Scalability


Scalability is another factor of Big Data Applications described by (Rys, 2011). Whenever we talk about Big Data, it mainly involves high-scaling systems. Each Big Data Application should be built in a way that eases scaling. (Rys, 2011) describes several needs for scaling: user load scalability, data load scalability, computational scalability and scale agility.

Data Scalability
Data Scalability

The figure illustrates the different needs for scalability in Big Data environments as described by (Rys, 2011). Many applications such as Facebook (Fowler, 2012) have a lot of users. Applications should support the large user base and should stay prone to errors in case the application sees unexpected high user numbers. Various techniques can be applied to support different needs such as fast data access. A factor that often – but not only – comes with a high number of users is the data load. (Rys, 2011) describes that some or many users can produce this data. However, things such as sensors and other devices that do not directly relate to users can also produce large datasets. Computational scalability is the ability to scale to large datasets. Data is often analyzed and this needs compute power on the analysis side. Distributed algorithms such as Map/Reduce require a lot of nodes in order to perform queries and analyze in a performing manner. Scale agility describes the possibility to change the environment of a system. This basically means that new instances such as compute can be added or removed on-demand. This requires a high level of automation and virtualization and is very similar to what can be done in cloud computing environments. Several Platforms such as Amazon EC2, Windows Azure, OpenStack, Eucalyptus and others enable this level of self-service that is a great support to scaling agility for Big Data environments.

Big Data 101: Data agility


Agility is an important factor to Big Data Applications. (Rys, 2011) describes 3 different agility factors which are: model agility, operational agility and programming ability.

Data agility
Data agility

Model agility means how easy it is to change the Data Model. Traditionally, in SQL Systems it is rather hard to change a schema. Other Systems such as non-relational Databases allow easy change to the Database. If we look at Key/Value Storages such as DynamoDB (Amazon Web Services, 2013), the change to a Model is very easy. Databases in fast changing systems such as Social Media Applications, Online Shops and other require model agility. Updates to such systems occur frequently, often weekly to daily (Paul, 2012).

In distributed environments, it is often necessary to change operational aspects of a System. New Servers get added often, also with different aspects such as Operating System and Hardware. Database systems should stay tolerant to operational changes, as this is a crucial factor to growth.

Database Systems should support the software developers. This is when programming agility comes into play. Programming agility describes the approach that the Database and all associated SDK’s should easy the live of a developer that is working with the Database itself. Furthermore, it should also support fast development.