Hadoop Tutorial – Serialising Data with Apache Avro


Apache Avro is a service in Hadoop that enables data serialization. The main tasks of Avro are:

  • Provide complex data structures
  • Provide a compact and fast binary data format
  • Provide a container to persist data
  • Provide RPC’s to the data
  • Enable the integration with dynamic languages

Avro is built with a JSON Schema, that allows several different types:

Elementary types

  • Null, Boolean, Int, Long, Float, Double, Byte and String

Complex types

  • Record, Enum, Array, Map, Union and Fixed

The sample below demonstrates an Avro schema

{“namespace”: “person.avro”,

“type”: “record”,

“name”: “Person”,

“fields”: [

{“name”: “name”, “type”: “string”},

{“name”: “age”,  “type”: [“int”, “null”]},

{“name”: “street”, “type”: [“string”, “null”]}

]

}

Table 4: an avro schema

Advertisements

Published by

Mario Meir-Huber

I work as Big Data Architect for Microsoft. With this role, I support my customers in applying Big Data technologies - mainly Hadoop/Spark - for their use-cases. I also teach this topic at various universities and frequently speak at various Conferences. In 2010 I wrote a book about Cloud Computing, which is often used at German & Austrian Universities. In my home country (Austria) I am part of several organisations on Big Data.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s