Hadoop Tutorial – Real-Time Data with Apache S4

S4 is another near-real-time project for Hadoop. S4 is built with a decentralized architecture in mind, focusing on a scaleable and event-oriented architecture. S4 is a long-running process that analyzes streaming data.

S4 is built with Java and with flexibility in mind. This is done via dependency injection, which makes the platform very easy to extend and change. S4 heavily relies on Loose-coupling and dynamic association via the Publish/Subscribe pattern. This makes it easy for S4 to integrate sub-systems into larger systems and updating services on sub-systems can be done independently.

S4 is built to be highly fault-tolerant. Mechanisms built into S4 allow fail-over and recovery.


Hadoop Tutorial – Accessing streaming data with Apache Storm

Apache Storm is in charge for analyzing streaming data in Hadoop. Storm is extremely powerful when analyzing streaming data and is capable of working near real-time. Storm was initially developed by Twitter to power their streaming API. At present, Storm is capable of processing 1 million tuples per node and second. The nice thing about Storm is that it scales linearly.

The Storm architecture is similar to other Hadoop projects. However, Storm comes with different challenges. First, there is Nimbus. Nimbus is the controller for Storm, which is similar to the JobTracker in Hadoop. Apache Storm also utilizes ZooKeeper. The Supervisor is on each instance and takes care of the tuples once they come in. The following figure shows this.

Storm Topology
Storm Topology

Major concepts in Apache Storm are 4 elements: streams, spouts, bolts and topologies.

Storm Tuples
Storm Tuples

Streams are an unbound sequence of Tuples, a Spout is a source of streams, Bolts process input streams and create new output streams and a topology is a network of Bolts and Spouts.