Virtualisation, Levels of Abstraction and Thinking Infrastructure

It’s 2010 and although Hoverboards are a little closer I still travel mostly by bus. In contrast, the way we’re using computing at Forward definitely seems to be mirroring a wider progression towards utility computing. What’s interesting is how this is actually achieved in two subtly different ways.

Firstly, let’s take the classic example: services deployed onto a distributed and external virtual machine environment managed via an API. We use this for some of our most important systems (managing millions and millions of transactions a day) and for a couple of great reasons: Amazon’s EC2 service offers us a level of distribution that would be very expensive (and time consuming) to build ourselves. As George mentioned in his post about our continual deployment process we make use of 4 geographically distributed compute zones.

This is virtualisation, but, at a pretty low-level. I could fire up a couple of EC2 nodes and do what I liked with them: deploy a bunch of Sinatra apps on Passenger, spin-up a temporary Hadoop cluster to do some heavy lifting, or perhaps do some video encoding.

Using the classic EC2 model I (as a consumer of the service) need to understand how to make use of the services that I then deploy. Of course, Amazon make it a little easier with pre-bundled AMI’s that contain pre-installed packages, but, this still needs me to be aware: pick the right CPU architecture, find the AMI with the version of RabbitMQ I’m after etc.

A lot of other ‘cloud’ providers (think Joyent Accelerators, Rackspace’s CloudServers SliceHost etc.) are very similar. Although you can programmatically control instances, only pay for the time you use them for etc., you’re still thinking at a relatively low systems level.

Amazon’s Elastic MapReduce Service is an example of a higher-level virtualised abstraction: I don’t care what’s going on underneath (although I have to pay depending on the capacity I want to give it). I submit my job and wait for the reply.

Heroku is another great example of this kind of higher-level service: deploy your code straight from the command-line, dynamically allocate resources etc. I don’t have to worry about a caching infrastructure- it’s built in. My application just needs to be a good HTTP citizen and things just work. Bliss.

Recently we made an investment in some dedicated hardware to replace the existing virtualised infrastructure that ran our Hadoop cluster. As alluded to in the original MapReduce paper: both the implementation and development model encourage a general model for large-scale data processing. Squint at your problem for long enough and it’ll probably fit into the MapReduce model. It’s not always that pleasant (or productive) to do that so there are a number of higher-level abstraction atop the map/reduce data flows to choose from: Cascading, Pig, and Hive are some good examples for Hadoop, Google also have their Sawzall paper.

Underneath all of that, however, is still a general platform for distributed computation: each layer builds on the previous.

MapReduce (and distributed storage), therefore, provide a kind of virtualisation albeit at a higher-level of abstraction to your average virtual machine. We’re consolidating workloads onto the same infrastructure.

We’re slowly moving more and more of our batch processing onto this infrastructure and (consequently) simplifying the way we deal with substantial growth. Batch processing large data is becoming part of our core infrastructure and, most importantly, is then able to be re-used by other parts of the business.

It feels like there are two different kinds of virtualisation at play here: Amazon EC2 (which offers raw compute power) and platforms like Hadoop which can provide a higher-level utility to a number of consumers. Naturally the former often provides the infrastructure to provide the latter (Elastic MapReduce being a good example).

Perhaps more significant is the progression towards even higher-levels of abstraction.

Google’s Jeff Dean gave a talk late last year (sorry, can’t find the link) about the next generation infrastructure that Google was building: the infrastructure was becoming intelligent.

Rather than worrying about how to deploy an application to get the best from it, and by building it upon some core higher-level services, the system could adapt to meet constraints. Need requests to the EU to be served within 1ms? The system could ensure data is replicated to a rack in a specific region. Need a batch to be finished by 9:00am? The system could ensure enough compute resources are allocated.

Amazon’s Elastic Load Balancing service includes an Auto Scale feature: set conditions that describe when instances should be added or removed and it will automatically respond. That’s great, but, I’d rather think in terms of application requirements. It’s a subtle shift in emphasis, much like the move from an imperative to declarative style.

I have no doubt that virtualisation has been profoundly significant. But, what really excites me is the move towards higher-level services that let me deploy into a set of infrastructure that can adapt to meet my requirements. It sounds as crazy as Hoverboards, but, it doesn’t feel that distant a reality.