Cloud is not the future


Now you probably think: is Mario crazy? In fact, during this post, I will explain why cloud is not the future.

First, let’s have a look at the economic facts of the cloud. If we look at share prices of companies providing cloud services, it is rather easy to say: those shares are skyrocketing! (Not mentioning recent drops in some shares, but these are rather market dynamics than real valuations). One thing is also about overall company performances: the income of companies providing cloud services increased a lot. Have a look at the major cloud providers such as AWS, Google, Oracle or Microsoft: they make quite a lot of their revenue now with cloud services. So, obviously here, my initial statement seems to be wrong. So why did I just choose this one? Still crazy?

Let’s look at another explanation on this: it might be all about technology, right? I was recently playing with AWS API Gateway and AWS Lambda. Wow, how easy is it to write a great API? I could program an API for an Android APP in some hours, deployment was easy. Remember back when you first had to deploy your full stack for this? Make sure to have all libraries set up and alike? Another sample: Data Analytics. Currently, much of this is moving from “classical” Hadoop-backed HDFS to decoupled Architectures (Object Stores as “Data Lake” and Spark for Compute/Analytics). This is also a clear favour for the Cloud, because both can be scaled individually and utilisation is easier to handle. When you need more compute power, you would spin up new instances and disconnect them again when you are done. This simply can’t be done with on-prem or private cloud, since the available capacity is calculated to match some corporate requirements. Also this is clearly in favour of the Cloud.

But what else? Let’s look at how new Applications or Services are developed. Nowadays, almost every Service is developed “Cloud first”, which means that they aren’t available without the cloud or at least they get available at a very late stage / substantial delay. So if you want to stay ahead in the innovation, it is necessary to embrace cloud here. And please don’t tell me that you would rather wait as it isn’t necessary to be with the first one’s to move. Answer: of course it is fine to wait until your business is dead ;).

So, there are no real points against the cloud, so why did I then formulate the title like this? Provocation? Clickbaiting? NO: Cloud is not the future, it is the present!

Advertisements

How to: Start and Stop Cloudera on Azure with the Azure CLI


The Azure CLI is my favorite tool to manage Hadoop Clusters on Azure. Why? Because I can use the tools I am used to from Linux now from my Windows PC. In Windows 10, I am using the Ubuntu Bash for that, which gives me all the major tools for managing remote Hadoop Clusters.

One thing I am doing frequently, is starting and stopping Hadoop Clusters based on Cloudera. If you are coming from Powershell, this might be rather painfull for you, since you can only start each vm in the cluster sequentially, meaning that a cluster consisting of 10 or more nodes is rather slow to start and might take hours! In the Azure CLI I can easily do this by specifiying “–nowait” and all runs in parallel. The only disadvantage is that I won’t get any notifications on when the cluster is ready. But I am doing this with a simple hack: ssh’ing into the cluster (since I have to do this anyway). SSH will succeed once the Masternodes are ready and so I can perform some tasks on the nodes (such as restarting Cloudera Manager since CM is usually a bit “dizzy” after sending it to sleep and waking it up again :))

Let’s start with the easiest step: stopping the cluster. The Azure CLI always starts with “az” as command (meaning Azure of course). The command for stopping one or more vm’s with the Azure CLI is “vm stop”. The only two things I need to provide now are the id’s I want to stop and “–nowait” since I want to quit the script right after.

So, the script would look like the following:

az vm stop --ids YOUR_IDS --no-wait

However, this has still one major disadvantage: you would need to provide all ID’s Hardcoded. This doesn’t matter at all if your cluster never changes, but in my case I add and delete vm’s to or from the cluster, so this script doesn’t play well for my case. However, the CLI is very flexible (and so is bash) and I can query all my vm’s in a resource group. This will give me the IDs which are currently in the cluster (let’s assume I delete dropped vm’s and add new vm’s to the RG). The Query for retrieving all VMs in a Resource Group is easy:

az vm list --resource-group YOUR_RESOURCE_GROUP --query "[].id" -o tsv

This will give me all IDs in the RG. The real fun starts when doing this in one statement:

az vm stop --ids $(az vm list --resource-group clouderarg --query "[].id" -o tsv) --no-wait

Which is really nice and easy 🙂

It is similar with starting VMs in a Resource Group:

az vm start --ids $(az vm list --resource-group mmhclouderarg --query "[].id" -o tsv) --no-wait

Hadoop Tutorial – Working with the Apache Hue GUI


When working with the main Hadoop services, it is not necessary to work with the console at all time (event though this is the most powerful way of doing so). Most Hadoop distributions also come with a User Interface. The user interface is called “Apache Hue” and is a web-based interface running on top of a distribution. Apache Hue integrates major Hadoop projects in the UI such as Hive, Pig and HCatalog. The nice thing about Apache Hue is that it makes the management of your Hadoop installation pretty easy with a great web-based UI.

The following screenshot shows Apache Hue on the Cloudera distribution.

apache-hue

Apache Hue

Hadoop Tutorial – Hadoop Commons


Apache Commons is one of the easiest things to explain in the Hadoop context – even though it might get complicated when working with it. Apache Commons is a collection of libraries and tools that are often necessary when working with Hadoop. These libraries and tools are then used by various projects in the Hadoop ecosystem. Samples include:

  • A CLI minicluster, that enables a single-node Hadoop installation for testing purposes
  • Native libraries for Hadoop
  • Authentification and superusers
  • A Hadoop secure mode

You might not use all of these tools and libraries that are in Hadoop Commons as some of them are only used when you work on Hadoop projects.

Hadoop Tutorial – Serialising Data with Apache Avro


Apache Avro is a service in Hadoop that enables data serialization. The main tasks of Avro are:

  • Provide complex data structures
  • Provide a compact and fast binary data format
  • Provide a container to persist data
  • Provide RPC’s to the data
  • Enable the integration with dynamic languages

Avro is built with a JSON Schema, that allows several different types:

Elementary types

  • Null, Boolean, Int, Long, Float, Double, Byte and String

Complex types

  • Record, Enum, Array, Map, Union and Fixed

The sample below demonstrates an Avro schema

{“namespace”: “person.avro”,

“type”: “record”,

“name”: “Person”,

“fields”: [

{“name”: “name”, “type”: “string”},

{“name”: “age”,  “type”: [“int”, “null”]},

{“name”: “street”, “type”: [“string”, “null”]}

]

}

Table 4: an avro schema

Hadoop Tutorial – Import large amount of data with Apache Sqoop


Apache Sqoop is in charge of moving large datasets between different storage systems such as relational databases to Hadoop. Sqoop supports a large number of connectors such as JDBC to work with different data sources. Sqoop makes it easy to import existing data into Hadoop.

Sqoop supports the following databases:

  • HSQLDB starting version 1.8
  • MySQL starting version 5.0
  • Oracle starting version 10.2
  • PostgreSQL
  • Microsoft SQL

Sqoop provides several possibilities to import and export data from and to Hadoop. The service also provides several mechanisms to validate data.

Hadoop Tutorial – Analysing Log Data with Apache Flume


Most IT departments produce a large amount of log data. This occurs especially when server systems are monitored, but it is also necessary for device monitoring. Apache Flume comes into play when this log data needs to be analyzed.

Flume is all about data collection and aggregation. The architecture is built with a flexible architecture that is based on streaming data flows. The service allows you to extend the data model. Key elements of Flume are:

  • Event. An event is data that is transported from one place to another place.
  • Flow. A flow consists of several events that are transported between several places.
  • Client. A client is the start of a transport. There are several clients available. A frequently used client for example is the Log4j appender.
  • Agent. An Agent is an independent process that provides components to flume.
  • Source. This is an interface implementation that is capable of transporting events. A sample of that is an Avro source.
  • Channels. If a source receives an event, this event is passed on to several channels. A channel is a storage that can handle the event, e.g. JDBC.
  • Sink. A sink takes an event from the channel and transports it to the next process.

The following figure illustrates the typical workflow for Apache Flume with its components.

Apache Flume
Apache Flume