Hadoop Tutorial – Getting started with Apache Pig


Apache Pig is an abstract language that puts data in the middle. Apache Pig is a “Data-flow” language. In contrast to SQL (and Hive), Pig goes an iterative way and lets data flow from one statement to another. This gives more powerful options when it comes to data. The language used for Apache Pig is called “PigLatin”. A key benefit of Apache Pig is that it abstracts complex tasks in MapReduce such as Joins to very easy functions in Apache Pig. Apache Pig is ways easier for Developers to write complex queries in Hadoop. Pig itself consists of two major components: PigLatin and a runtime environment.

When running Apache Pig, there are two possibilities: the first one is the stand alone mode which is intended to rather small datasets within a virtual machine. On processing Big Data, it is necessary to run Pig in the MapReduce Mode on top of HDFS. Pig applications are usually script files (with the extension .pig) that consist of a series of operations and transformations, that create output data from input data. Pig itself transforms these operations and transformations to MapReduce functions. The set of operations and transformations available by the language can easily be extended via custom code. When compared to the performance of “pure” MapReduce, Pig is a bit slower, but still very close to the native MapReduce performance. Especially for that not experienced in MapReduce, Pig is a great tool (and ways easier to learn than MapReduce)

When writing a Pig application, this application can easily be executed as a script in the Hadoop environment. Especially when using the previously demonstrated Hadoop VM’s, it is easy to get started. Another possibility is to work with Grunt, which allows us to execute Pig commands in the console. The third possibility to run Pig is to embed them in a Java application.

The question is, what differentiates Pig from SQL/Hive. First, Pig is a data-flow language. It is oriented on the data and how it is transformed from one statement to another. It works on a step-by-step iteration and transforms data. Another difference is that SQL needs a schema, but Pig doesn’t. The only dependency is that data needs to be able to work with it in parallel.

The table below will show a sample program. We will look at the possibilities within the next blog posts.

[av_promobox button=’no’ label=’Click me’ link=’manually,http://’ link_target=” color=’theme-color’ custom_bg=’#444444′ custom_font=’#ffffff’ size=’large’ icon_select=’no’ icon=’ue800′ font=’entypo-fontello’]

A = LOAD ‘student‘ USING PigStorage() AS (name:chararray, age:int, gpa:float);

X = FOREACH A GENERATE name,$2;

DUMP X;

(John,4.0F)

(Mary,3.8F)

(Bill,3.9F)

(Joe,3.8F)
[/av_promobox]

Advertisements

Hadoop Tutorial – Getting started with Apache Hadoop


Hadoop is one of the most popular Big Data technologies, or maybe the key Big Data technology. Due to large demand for Hadoop, I’ve decided to write a short Hadoop tutorial series here. In the next weeks, I will write several articles on the Hadoop platform and key technologies.

When we talk about Hadoop, we don’t talk about one specific software or a service. The Hadoop project features several projects, each of them serving different topics in the Big Data ecosystem. When handling Data, Hadoop is very different to traditional RDBMS systems. Key differences are:

  • Hadoop is about large amounts of data. Traditional database systems were only about some gigabyte or terabyte of data, Hadoop can handle much more. Petabytes are not a problem for Hadoop
  • RDBMS work with an interactive access to data, whereas Hadoop is batch-oriented.
  • With traditional database systems, the approach was “read many, write many”. That means, that data gets written often but also modified often. With Hadoop, this is different: the approach now is “write once, read many”. This means that data is written once and then never gets changed. The only purpose is to read the data for analytics.
  • RDBMS systems have schemas. When you design an application, you first need to create the schema of the database. With Hadoop, this is different: the schema is very flexible, it is actually schema-less
  • Last but not least, Hadoop scales linear. If you add 10% more compute capacity, you will get about the same amount of performance. RDBMS are different; at a certain point, scaling them gets really difficult.

Central to Hadoop is the Map/Reduce algorithm. This algorithm was usually introduced by Google to power their search engine. However, the algorithm turned out to be very efficient for distributed systems, so it is nowadays used in many technologies. When you run queries in Hadoop with languages such as Hive or Pig (I will explain them later), these queries are translated to Map/Reduce algorithms by Hadoop. The following figure shows the Map/Reduce algorithm:

Map Reduce function
Map Reduce function

The Map/Reduce function has some steps:

  1. All input data is distributed to the Map functions
  2. The Map functions are running in parallel. The distribution and failover is handled entirely by Hadoop.
  3. The Map functions emit data to a temporary storage
  4. The Reduce function now calculates the temporary stored data

A typical sample is the word-count. With word-count, input data as text is put to a Map function. The Map function adds all words of the same kind to a list in the temporary store. The reduce-function now counts the words and builds a sum.

Next week I will blog about the different Hadoop projects. As already mentioned earlier, Hadoop consists of several other projects.