How to create Data analytics projects

Big Data Analysis is something that needs some iteration principles. If we look at a famous novel where a lot of data was analyzed, “A hitchhikers guide to the galaxy” became famous. In the novel, someone asked a supercomputer a question: “Answer to the Ultimate Question of Life, the Universe, and Everything”. As this was a quite difficult problem to solve, iteration for that is necessary.

(Bizer, Boncz, Brodie , & Erling, 2011) describes some iteration steps on creating Big Data Analysis Applications. Five easy steps are mentioned in this paper: Define, Search, Transform, Entity Resolution and Answer the Query.

Define deals with the problem that needs to be solved. This is when the marketing manager asks: “we need to find a location in county “xy”, where customers age is over 18 and below 30 and we have no store yet”. In our initial description of “A hitchhikers guide to the galaxy”, this would be the question about the answer to everything.

Next, we identify candidate elements in the Big Data space. (Bizer, Boncz, Brodie , & Erling, 2011) names this “search”. In the marketing sample, this would mean that we have to scan all data of all users that are between 18 and 30. This data must be combined with store locations. In the “hitchhikers guide to the galaxy”, this would mean that we have to scan all data – as we try to find the answer to all.

Transform means that the data identified has to be “extracted, transformed and loaded” into appropriate formats. This is part of the preparation phase, since the data is now almost ready for calculation. Data is extracted from different sources and transformed into a unique format that can be used for the analysis. In the marketing example, we will need to use sources from the government and combine it with our own data on customers. Furthermore, we need map data. All this data is now stored in our database for analysis. It is more complicated with the “hitchhikers” problem: since we need to analyze ALL data available in the universe, we simply can’t copy this to a new system. The analysis has to be done on the systems it is stored at.

After the data elements are prepared, we need to resolve elements. In this phase, we ensure that data entities are unique, relevant and comprehensive. In the marketing example, this would mean that all elements are resolved that have an age of 18 to 30. In the hitchhiker’s problem, we can’t resolve entities. Once again, we need to find the answer to all and can’t afford to exclude data.

In the last step, the data is finally analyzed. (Bizer, Boncz, Brodie , & Erling, 2011) describes this as “answer the query”. Basically this means that the data analysis is done. Big Data analysis usually needs a lot of nodes that compute the results out of the datasets available. In our marketing sample, we would look at the resolved data sets and compare it with our store locations. The result would be a list of counties where no store is available yet and the condition is fulfilled. In our hitchhikers sample, we would analyze all data and look for the ultimate answer.

The following figure shows the 5 Steps for Big Data Analysis displays the necessary steps for Big Data Analysis as described above.

Data iteration
Data iteration

Published by

Mario Meir-Huber

I work as Big Data Architect for Microsoft. With this role, I support my customers in applying Big Data technologies - mainly Hadoop/Spark - for their use-cases. I also teach this topic at various universities and frequently speak at various Conferences. In 2010 I wrote a book about Cloud Computing, which is often used at German & Austrian Universities. In my home country (Austria) I am part of several organisations on Big Data.

One thought on “How to create Data analytics projects”

  1. I love the endorsement of iteration. Far too often organizations dive into big data analytics with the thought of immediately hitting the big media worthy wins. While the big wins will come to most, it’s often only the result of learning how to ask the right questions, leverage data and build models. It’s the many small wins that build the confidence to take the additional steps.

    Peter Fretty, IDG blogger working on behalf of SAS

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s