Hadoop Tutorial – Scheduling MapReduce Workflows with Oozie


Apache Oozie is the workflow scheduler for Hadoop Jobs. Oozie basically takes care of the step-wise workflow iteration in Hadoop. Oozie is like all other Hadoop projects built for high scalability, fault tolerance and extensible.

An Oozie Workflow is started by data availability or after a specific time. Oozie is the root for all MapReduce jobs as they get scheduled via Oozie. This also means that all other projects such as Pig and Hive (which we will discuss later on) also take advantage of Oozie.

Oozie workflows are described in an XML-Dialect, which is called hPDL. Oozie knows two different types of nodes:

  • Control-Flow-Nodes that take do exactly what the name says: controlling the flow.
  • Action-Nodes take care of the actual execution of a job.

The following illustration shows the iteration process in an Oozie Workflow. The first step for Oozie is to start a task (MapReduce Job) on a remote system. Once the task has completed, the remote system sends the result back to the remote system via a callback function.

Apache Oozie
Apache Oozie
Advertisements

Published by

Mario Meir-Huber

I work as Big Data Architect for Microsoft. With this role, I support my customers in applying Big Data technologies - mainly Hadoop/Spark - for their use-cases. I also teach this topic at various universities and frequently speak at various Conferences. In 2010 I wrote a book about Cloud Computing, which is often used at German & Austrian Universities. In my home country (Austria) I am part of several organisations on Big Data.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s