Apache Oozie is the workflow scheduler for Hadoop Jobs. Oozie basically takes care of the step-wise workflow iteration in Hadoop. Oozie is like all other Hadoop projects built for high scalability, fault tolerance and extensible.
An Oozie Workflow is started by data availability or after a specific time. Oozie is the root for all MapReduce jobs as they get scheduled via Oozie. This also means that all other projects such as Pig and Hive (which we will discuss later on) also take advantage of Oozie.
Oozie workflows are described in an XML-Dialect, which is called hPDL. Oozie knows two different types of nodes:
- Control-Flow-Nodes that take do exactly what the name says: controlling the flow.
- Action-Nodes take care of the actual execution of a job.
The following illustration shows the iteration process in an Oozie Workflow. The first step for Oozie is to start a task (MapReduce Job) on a remote system. Once the task has completed, the remote system sends the result back to the remote system via a callback function.