Big Data – how to achieve data quality


To Store Data, some attributes regarding the data must be fulfilled. (Heinrich & Stelzer, 2011) defined some data quality attributes that should be fulfilled.

 

Data quality attributes
Data quality attributes
  • Relevance. Data should be relevant to the use-case. If a query should look up all available users interested in “luxury cars” in a web portal, all these users should be returned. It should be possible to take some advantage out of these data, e.g. for advanced marketing targeting.
  • Correctness. Data has to be correct. If we again query for all existing users on a web portal interested in luxury cars, the data about that should be correct. By correctness, it is meant that the data should really represent people interested in luxury cars and that faked entries should be removed.
  • Completeness. Data should be complete. Targeting all users interested in luxury cars only makes sense if we can target them somehow, e.g. by e-mail. If the e-mail field is blank or any other field we would like to target our users, data is not complete for our use-case.
  • Timeliness. Data should be up-to date. A user might change the e-mail address after a while and our database should reflect these changes whenever and wherever possible. If we target our users for luxury cars, it won’t be good at all if only 50% of the user’s e-mail addresses are correct. We might have “big data” but the data is not correct since updates didn’t occur for a while.
  • Accuracy. Data should be as accurate as possible. Web site users should have the possibility to specify, “Yes, I am interested in luxury cars” instead of defining their favorite brand (which could be done additionally). If the users have the possibility to select a favorite brand, it might be accurate but not accurate enough. Imagine someone selects “BMW” as favorite brand. BMW could be considered as luxury car but they also have different models. If someone selects BMW just because one likes the sport features, the targeting mechanism might hit the wrong people.
  • Consistency. This shouldn’t be confused with the consistency requirement by the CAP-Theorem (see next section). Data might be duplicated, since users might register several times to get various benefits. The user might select “luxury cars” and with another account “budget cars”. Duplicate accounts leads to inconsistency of data and it is a frequent problem in large web portals such as Facebook (Kelly, 2012).
  • Availability. Availability states that data should be available. If we want to query all existing users interested in luxury cars, we are not interested in a subset but all of them. Availability is also a challenge addressed by the CAP-Theorem. In this case, it doesn’t focus on the general availability of the database but at the availability of each dataset itself. The algorithm querying the data should be as good as possible to retrieve all available data. There should be easy to use tools and languages to retrieve the data. Normally, each database provides a query language such as SQL, or O/R Mappers to developers.
  • Understandability. It should be easy to understand data. If we query our database for people interested in luxury cars, we should have the possibility to easily understand what the data is about. Once the data is returned, we should use our favorite tool to work with the data. The data itself should describe itself and we should know how to handle it. If the data returns a “zip” column, we know that this is the ZIP-code individual users are living in.

 

Advertisements

Published by

Mario Meir-Huber

I work as Big Data Architect for Microsoft. With this role, I support my customers in applying Big Data technologies - mainly Hadoop/Spark - for their use-cases. I also teach this topic at various universities and frequently speak at various Conferences. In 2010 I wrote a book about Cloud Computing, which is often used at German & Austrian Universities. In my home country (Austria) I am part of several organisations on Big Data.

3 thoughts on “Big Data – how to achieve data quality”

  1. Now the question is- how to Händle DQ in Big Data Projects? How to measure wether I collected all Tweets, Blog posts so I got all Sentiments about my products?

    1. there are still other questions – such as if a tweet is a postive one or a negative one – is it sarcasm or so ;)?
      ML might give an answer, but it requires a lot of work though … 😉

  2. Great post. Data quality is such an important component to making sure any data platform works. In fact it should be at the core of any data culture – an appreciation and understanding of the data the organization collects and leverages. I would say as an industry we are still maturing, so unfortunately it will still take some time before data quality gets the recognition it deserves.

    Peter Fretty, IDG blogger working on behalf of SAS

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s