Big Data Analysis is something that needs some iteration principles. If we look at a famous novel where a lot of data was analyzed, “A hitchhikers guide to the galaxy” became famous. In the novel, someone asked a supercomputer a question: “Answer to the Ultimate Question of Life, the Universe, and Everything”. As this was a quite difficult problem to solve, iteration for that is necessary.
(Bizer, Boncz, Brodie , & Erling, 2011) describes some iteration steps on creating Big Data Analysis Applications. Five easy steps are mentioned in this paper: Define, Search, Transform, Entity Resolution and Answer the Query.
Define deals with the problem that needs to be solved. This is when the marketing manager asks: “we need to find a location in county “xy”, where customers age is over 18 and below 30 and we have no store yet”. In our initial description of “A hitchhikers guide to the galaxy”, this would be the question about the answer to everything.
Next, we identify candidate elements in the Big Data space. (Bizer, Boncz, Brodie , & Erling, 2011) names this “search”. In the marketing sample, this would mean that we have to scan all data of all users that are between 18 and 30. This data must be combined with store locations. In the “hitchhikers guide to the galaxy”, this would mean that we have to scan all data – as we try to find the answer to all.
Transform means that the data identified has to be “extracted, transformed and loaded” into appropriate formats. This is part of the preparation phase, since the data is now almost ready for calculation. Data is extracted from different sources and transformed into a unique format that can be used for the analysis. In the marketing example, we will need to use sources from the government and combine it with our own data on customers. Furthermore, we need map data. All this data is now stored in our database for analysis. It is more complicated with the “hitchhikers” problem: since we need to analyze ALL data available in the universe, we simply can’t copy this to a new system. The analysis has to be done on the systems it is stored at.
After the data elements are prepared, we need to resolve elements. In this phase, we ensure that data entities are unique, relevant and comprehensive. In the marketing example, this would mean that all elements are resolved that have an age of 18 to 30. In the hitchhiker’s problem, we can’t resolve entities. Once again, we need to find the answer to all and can’t afford to exclude data.
In the last step, the data is finally analyzed. (Bizer, Boncz, Brodie , & Erling, 2011) describes this as “answer the query”. Basically this means that the data analysis is done. Big Data analysis usually needs a lot of nodes that compute the results out of the datasets available. In our marketing sample, we would look at the resolved data sets and compare it with our store locations. The result would be a list of counties where no store is available yet and the condition is fulfilled. In our hitchhikers sample, we would analyze all data and look for the ultimate answer.
The following figure shows the 5 Steps for Big Data Analysis displays the necessary steps for Big Data Analysis as described above.