Hadoop Tutorial – Getting started with Apache Pig


Apache Pig is an abstract language that puts data in the middle. Apache Pig is a “Data-flow” language. In contrast to SQL (and Hive), Pig goes an iterative way and lets data flow from one statement to another. This gives more powerful options when it comes to data. The language used for Apache Pig is called “PigLatin”. A key benefit of Apache Pig is that it abstracts complex tasks in MapReduce such as Joins to very easy functions in Apache Pig. Apache Pig is ways easier for Developers to write complex queries in Hadoop. Pig itself consists of two major components: PigLatin and a runtime environment.

When running Apache Pig, there are two possibilities: the first one is the stand alone mode which is intended to rather small datasets within a virtual machine. On processing Big Data, it is necessary to run Pig in the MapReduce Mode on top of HDFS. Pig applications are usually script files (with the extension .pig) that consist of a series of operations and transformations, that create output data from input data. Pig itself transforms these operations and transformations to MapReduce functions. The set of operations and transformations available by the language can easily be extended via custom code. When compared to the performance of “pure” MapReduce, Pig is a bit slower, but still very close to the native MapReduce performance. Especially for that not experienced in MapReduce, Pig is a great tool (and ways easier to learn than MapReduce)

When writing a Pig application, this application can easily be executed as a script in the Hadoop environment. Especially when using the previously demonstrated Hadoop VM’s, it is easy to get started. Another possibility is to work with Grunt, which allows us to execute Pig commands in the console. The third possibility to run Pig is to embed them in a Java application.

The question is, what differentiates Pig from SQL/Hive. First, Pig is a data-flow language. It is oriented on the data and how it is transformed from one statement to another. It works on a step-by-step iteration and transforms data. Another difference is that SQL needs a schema, but Pig doesn’t. The only dependency is that data needs to be able to work with it in parallel.

The table below will show a sample program. We will look at the possibilities within the next blog posts.

[av_promobox button=’no’ label=’Click me’ link=’manually,http://’ link_target=” color=’theme-color’ custom_bg=’#444444′ custom_font=’#ffffff’ size=’large’ icon_select=’no’ icon=’ue800′ font=’entypo-fontello’]

A = LOAD ‘student‘ USING PigStorage() AS (name:chararray, age:int, gpa:float);

X = FOREACH A GENERATE name,$2;

DUMP X;

(John,4.0F)

(Mary,3.8F)

(Bill,3.9F)

(Joe,3.8F)
[/av_promobox]

Advertisements

Published by

Mario Meir-Huber

I work as Big Data Architect for Microsoft. With this role, I support my customers in applying Big Data technologies - mainly Hadoop/Spark - for their use-cases. I also teach this topic at various universities and frequently speak at various Conferences. In 2010 I wrote a book about Cloud Computing, which is often used at German & Austrian Universities. In my home country (Austria) I am part of several organisations on Big Data.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s