International Data Science Conference, Salzburg


Hi,

I am happy to share this exciting conference I am keynoting at. Also, Mike Ohlsen from Cloudera will deliver a keynote at the conference.

About the conference:

June 12th – 13th 2017 | Salzburg, Austria | www.idsc.at

The 1st International Data Science Conference (iDSC 2017) organized by Salzburg University of Applied Sciences (Information Technology and Systems Management) in cooperation with Information Professionals GmbH seeks to establish a key Data Science event, providing a forum for an international exchange on Data Science technologies and applications.

The International Data Science Conference gives the participants the opportunity, over the course of two days, to delve into the most current research and up-to-date practice in Data Science and data-driven business. Besides the two parallel tracks, the Research Track and the Industry Track, on the second day a Symposium is taking place presenting the outcomes of a European Project on Text and Data Mining (TDM). These events are open to all participants.

Also we are proud to announce keynote presentations from Mike Olson (Chief Strategy Officer Cloudera), Ralf Klinkenberg (General Manager RapidMiner), Euro Beinat (Data-Science Professor and Managing Director CS Research), Mario Meir-Huber (Big Data Architect Microsoft). These keynotes will be distributed over both conference days, providing times for all participants to come together and share views on challenges and trends in Data Science.

The Research Track offers a series of short presentations from Data Science researchers on their own, current papers. On both conference days, we are planning a morning and an afternoon session presenting the results of innovative research into data mining, machine learning, data management and the entire spectrum of Data Science.

The Industry Track showcases real practitioners of data-driven business and how they use Data Science to help achieve organizational goals. Though not restricted to these topics only, the industry talks will concentrate on our broad focus areas of manufacturing, retail and social good. Users of data technologies can meet with peers and exchange ideas and solutions to the practical challenges of data-driven business.

Futhermore the Symposium is organized in collaboration with the FutureTDM Consortium. FutureTDM is a European project which over the last two years has been identifying the legal and technical barriers, as well as the skills stakeholders/practitioners lack, that inhibit the uptake of text and data mining for researchers and innovative businesses. The recommendations and guidelines recognized and proposed to counterbalance these barriers, so as to ensure broader TDM uptake and thus boost Europe’s research and innovation capacities, will be the focus of the Symposium.

Our sponsors ClouderaF&F and um etc. will have their own, special platform: half-day workshops to provide hands-on interaction with tools or to learn approaches to developing concrete solutions. In addition, there will be an exhibition of the sponsors’ products and services throughout the conference, with the opportunity for the participants to seek contact and advice.

The iDSC 2017 is therefore a unique meeting place for researchers, business managers, and data scientists to discover novel approaches and to share solutions to the challenges of a data-driven world.

Hadoop Tutorial – Data Science with Apache Mahout


Apache Mahout is the service on Hadoop that is in charge of what is often called “data science”. Mahout is all about learning algorithms, pattern recognition and alike. An interesting fact about Mahout is that under the hood MapReduce was replaced by Spark.

Mahout is in charge of the following tasks:

  • Machine Learning. Learning from existing data and.
  • Recommendation Mining. This is what we often see at websites. Remember the “You bought X, you might be interested in Y”? This is exactly what Mahout can do for you.
  • Cluster data. Mahout can cluster documents and data that has some similarities.
  • Classification. Learn from existing classifications.

A Mahout program is written in Java. The next listing shows how the recommendation builder works.

DataModel model = new FileDataModel(new File(“/home/var/mydata.xml”));

 

RecommenderEvaluator eval = new AverageAbsoluteDifferenceRecommenderEvaluator();

 

RecommenderBuilder builder = new MyRecommenderBuilder();

 

Double res = eval.evaluate(builder, null, model, 0.9, 1.0);

 

System.out.println(result);

A Mahout program

Big Data in Manufacturing


Big Data is a disruptive technology. It is changing major industries from the inside. In the next posts, we will learn how Big Data changes different industries.

Today’s focus: Big Data in Manufacturing.

Manufacturing is a traditional industry relevant to almost any country in the world. It started to emerge in the industrial revolution, when machines took over and production became more and more automated. Big Data has the possibility to substantially change the manufacturing industry again – with various opportunities.

Manufactures can utilize Big Data for various reasons. First, it is all about quality. When we look at production chains, may it be producing a car or just some metal works, quality is key. Who wants to buy a car that is broken? Exactly, nobody. Improving quality is a key aspect in Big Data for manufacturers. As of Big Data, this can come with several aspects. First of all, it is necessary to collect data about the production line(s) and all devices that are connected or connect-able. When errors occur or a product isn’t as desired, the production data can be analyzed and reviewed. Data scientists basically do a great job on that. Real-Time analytics allow the company to improve the material quality and product quality again. This can be done by analyzing images of products or materials and removing them from the production line in case they don’t fulfill certain standards.

A key challenge today in manufacturing is the high degree of product customization. When buying a new car, the words by Henry Ford (you can have any type of the T-model as long as it is black) are not true any more. When customers order whatever type of product, customers expect that their own personality is reflected by the product. If a company fails to deliver that, they might risk loosing customers. But what is the affiliation with Big Data now? Well, this customization is a strong shift towards Industry 4.0, which is heavily promoted by the German industry. In order to make products customize able, it is necessary to have an automated product line and to know what customers might want – by analyzing recent sales and trends from social networks and alike.

Changing the output of a production line is often difficult and ineffective. Big Data analytics allow manufacturers to better understand future demands and they can reduce production pikes. This enables the manufacturer to better plan and act in the market – and get more efficient.

Big Data 101: Partitioning


Partitioning is another factor for Big Data Applications. It is one of the factors of the CAP-Theorem (see 1.6.1) and is also important for scaling applications. Partitioning basically describes the ability to distribute a database over different servers. In Big Data Applications, it is often not possible to store everything on one (Josuttis, 2011)

Data Partitioning
Data Partitioning

The factors for partitioning illustrated in the Figure: Partitioning are described by (Rys, 2011). Functional partitioning is basically describing the service oriented architecture (SOA) approach (Josuttis, 2011). With SOA, different functions are provided by their own services. If we talk about a Web shop such as Amazon, there are a lot of different services involved. Some Services handle the Order Workflow; other Services handle the search and so on. If there is high load on a specific service such as the shopping cart, new instances can be added on demand. This reduces the risk of an outage that would lead to loosing money. Building a service-oriented architecture simply doesn’t solve all problems for partitioning. Therefore, data also has to be partitioned. By data partitioning, all data is distributed over different servers. They can also be distributed geographically. A partition key basically identifies partitioned Data. Since there is a lot of data available and single nodes may fail, it is necessary to partition data in the network. This means that data should be replicated and stored redundant in order to deal with node failures.

Big Data 101: Scalability


Scalability is another factor of Big Data Applications described by (Rys, 2011). Whenever we talk about Big Data, it mainly involves high-scaling systems. Each Big Data Application should be built in a way that eases scaling. (Rys, 2011) describes several needs for scaling: user load scalability, data load scalability, computational scalability and scale agility.

Data Scalability
Data Scalability

The figure illustrates the different needs for scalability in Big Data environments as described by (Rys, 2011). Many applications such as Facebook (Fowler, 2012) have a lot of users. Applications should support the large user base and should stay prone to errors in case the application sees unexpected high user numbers. Various techniques can be applied to support different needs such as fast data access. A factor that often – but not only – comes with a high number of users is the data load. (Rys, 2011) describes that some or many users can produce this data. However, things such as sensors and other devices that do not directly relate to users can also produce large datasets. Computational scalability is the ability to scale to large datasets. Data is often analyzed and this needs compute power on the analysis side. Distributed algorithms such as Map/Reduce require a lot of nodes in order to perform queries and analyze in a performing manner. Scale agility describes the possibility to change the environment of a system. This basically means that new instances such as compute can be added or removed on-demand. This requires a high level of automation and virtualization and is very similar to what can be done in cloud computing environments. Several Platforms such as Amazon EC2, Windows Azure, OpenStack, Eucalyptus and others enable this level of self-service that is a great support to scaling agility for Big Data environments.

Big Data 101: Data agility


Agility is an important factor to Big Data Applications. (Rys, 2011) describes 3 different agility factors which are: model agility, operational agility and programming ability.

Data agility
Data agility

Model agility means how easy it is to change the Data Model. Traditionally, in SQL Systems it is rather hard to change a schema. Other Systems such as non-relational Databases allow easy change to the Database. If we look at Key/Value Storages such as DynamoDB (Amazon Web Services, 2013), the change to a Model is very easy. Databases in fast changing systems such as Social Media Applications, Online Shops and other require model agility. Updates to such systems occur frequently, often weekly to daily (Paul, 2012).

In distributed environments, it is often necessary to change operational aspects of a System. New Servers get added often, also with different aspects such as Operating System and Hardware. Database systems should stay tolerant to operational changes, as this is a crucial factor to growth.

Database Systems should support the software developers. This is when programming agility comes into play. Programming agility describes the approach that the Database and all associated SDK’s should easy the live of a developer that is working with the Database itself. Furthermore, it should also support fast development.

Are you a Data Scientist or what is necessary to become one?


Big Data is considered to be the job you simply have to go for. Some call it sexy, some call it the best job in the future. But what exactly is a Data Scientist? Is it someone you can simply hire from university or is it more complicated? Definitely the last one applies for that.

When we think about a Data Scientist, we often say that the perfect Data Scientist is kind of a hybrid between a Statistician and Computer Scientist. I think this needs to be redefined, since much more knowledge is necessary. A Data Scientist should also be good in analysing business cases and talk to line executives to understand the problem and model an ideal solution. Furthermore, extensive knowledge on current (international) law is necessary. In a recent study we did, we defined 5 major challenges:

perfect-data-scientist

Each of the 5 topics are about:

  • Big Data Business Developer: The person needs to know what questions to ask, how to cooperate with line of business (LOB) decision makers and must have good social skills to cooperate with all of them.
  • Big Data Technologist: In case your company isn’t using the cloud for Big Data Analytics, you also need to be into infrastructure. The person must know a lot about system infrastructure, distributed systems, datacenter design and operating systems. Furthermore, it is also important to know how to run your software. Hadoop doesn’t install itself and there is some maintenance necessary.
  • Big Data Analyst: This is the fun part; here it is all about writing your queries, running Hadoop jobs, doing fancy MapReduce queries and so on! However, the person should know what to analyse and how to implement such algorithms. It is also about machine learning and more advanced topics.
  • Big Data Developer: Here it is more about writing extensions, add-ons and other stuff. It is also about distributed programming, which isn’t the easiest part itself.
  • Big Data Artist: Got the hardware/datacenter right? Know what to analyse? Wrote the algorithms? What about presenting them to your management? Exactly! This is also necessary! You simply shouldn’t forget about that. The best data is worth noting if nobody is interested in it because of poor presentation. It is also necessary to know how to present your data.

As you can see, it is very hard to become a data scientist. Things are not as easy as it might seems. The Data Scientist should be a nerd in each of these fields, so the person should be some kind of a “super nerd”. This might be the super hero of the future.

Most likely, you won’t find one person that is good in all of these fields. Therefore, it is necessary to build an effective team.

Header Image Copyright: Chase Elliott Clark