Managing Deployment SSH Keys

At Forward the number of virtual machines we're deploying to is increasing steadily; on EC2 alone we have over 30 at the moment. Managing authentication to those servers was becoming more time consuming.

Previously we'd used a specific user with a password that was shared between those that needed to access the machines and keeping that up to date was often unreliable; machines would not be updated with the new password, and everybody would have to be told of the new password.

We wanted something better so we're now using a git repository to sync public keys.

It's easy to manage, easy to add new keys, and easy to track changes. We can manage permissions to the repository, remove keys when necessary, and it's very easy to make sure all machines are constantly up to date.

To do this we have a repository that contains a set of user.pub public key files, copied directly from the user's ~/.ssh/id_dsa.pub file (for example).

Machine images have the GitHub signature already accepted and a clone of the repository. A simple Bash script then executes regularly via cron, pulling any changes and updating the authorized_keys file.

It's been working pretty well so far, much easier than before!

Virtualisation, Levels of Abstraction and Thinking Infrastructure

It’s 2010 and although Hoverboards are a little closer I still travel mostly by bus. In contrast, the way we’re using computing at Forward definitely seems to be mirroring a wider progression towards utility computing. What’s interesting is how this is actually achieved in two subtly different ways.

Firstly, let’s take the classic example: services deployed onto a distributed and external virtual machine environment managed via an API. We use this for some of our most important systems (managing millions and millions of transactions a day) and for a couple of great reasons: Amazon’s EC2 service offers us a level of distribution that would be very expensive (and time consuming) to build ourselves. As George mentioned in his post about our continual deployment process we make use of 4 geographically distributed compute zones.

This is virtualisation, but, at a pretty low-level. I could fire up a couple of EC2 nodes and do what I liked with them: deploy a bunch of Sinatra apps on Passenger, spin-up a temporary Hadoop cluster to do some heavy lifting, or perhaps do some video encoding.

Using the classic EC2 model I (as a consumer of the service) need to understand how to make use of the services that I then deploy. Of course, Amazon make it a little easier with pre-bundled AMI’s that contain pre-installed packages, but, this still needs me to be aware: pick the right CPU architecture, find the AMI with the version of RabbitMQ I’m after etc.

A lot of other ‘cloud’ providers (think Joyent Accelerators, Rackspace’s CloudServers SliceHost etc.) are very similar. Although you can programmatically control instances, only pay for the time you use them for etc., you’re still thinking at a relatively low systems level.

Amazon’s Elastic MapReduce Service is an example of a higher-level virtualised abstraction: I don’t care what’s going on underneath (although I have to pay depending on the capacity I want to give it). I submit my job and wait for the reply.

Heroku is another great example of this kind of higher-level service: deploy your code straight from the command-line, dynamically allocate resources etc. I don’t have to worry about a caching infrastructure- it’s built in. My application just needs to be a good HTTP citizen and things just work. Bliss.

Recently we made an investment in some dedicated hardware to replace the existing virtualised infrastructure that ran our Hadoop cluster. As alluded to in the original MapReduce paper: both the implementation and development model encourage a general model for large-scale data processing. Squint at your problem for long enough and it’ll probably fit into the MapReduce model. It’s not always that pleasant (or productive) to do that so there are a number of higher-level abstraction atop the map/reduce data flows to choose from: Cascading, Pig, and Hive are some good examples for Hadoop, Google also have their Sawzall paper.

Underneath all of that, however, is still a general platform for distributed computation: each layer builds on the previous.

MapReduce (and distributed storage), therefore, provide a kind of virtualisation albeit at a higher-level of abstraction to your average virtual machine. We’re consolidating workloads onto the same infrastructure.

We’re slowly moving more and more of our batch processing onto this infrastructure and (consequently) simplifying the way we deal with substantial growth. Batch processing large data is becoming part of our core infrastructure and, most importantly, is then able to be re-used by other parts of the business.

It feels like there are two different kinds of virtualisation at play here: Amazon EC2 (which offers raw compute power) and platforms like Hadoop which can provide a higher-level utility to a number of consumers. Naturally the former often provides the infrastructure to provide the latter (Elastic MapReduce being a good example).

Perhaps more significant is the progression towards even higher-levels of abstraction.

Google’s Jeff Dean gave a talk late last year (sorry, can’t find the link) about the next generation infrastructure that Google was building: the infrastructure was becoming intelligent.

Rather than worrying about how to deploy an application to get the best from it, and by building it upon some core higher-level services, the system could adapt to meet constraints. Need requests to the EU to be served within 1ms? The system could ensure data is replicated to a rack in a specific region. Need a batch to be finished by 9:00am? The system could ensure enough compute resources are allocated.

Amazon’s Elastic Load Balancing service includes an Auto Scale feature: set conditions that describe when instances should be added or removed and it will automatically respond. That’s great, but, I’d rather think in terms of application requirements. It’s a subtle shift in emphasis, much like the move from an imperative to declarative style.

I have no doubt that virtualisation has been profoundly significant. But, what really excites me is the move towards higher-level services that let me deploy into a set of infrastructure that can adapt to meet my requirements. It sounds as crazy as Hoverboards, but, it doesn’t feel that distant a reality.

Processing XML in Hadoop

This is something that I’ve been battling with for a while and tonight, whilst not able to sleep, I finally found a way to crack it! The answer: use an XmlInputFormat from a related project. Now for the story.

We process lots of XML documents every day and some of them are pretty large: over 8 gigabytes uncompressed. Each. Yikes!

We’d made significant performance gains by switching the old REXML SAX parser to libxml. But we’ve suffered from random segfaults on our production servers (seemingly caused by garbage collecting bad objects). Besides, this still was running at nearly an hour for the largest reports.

Hadoop did seem to offer XML processing: the general advice was to use Hadoops’s StreamXmlRecordReader which can be accessed through using the StreamInputFormat.

This seemed to have weird behaviour with our reports (which often don’t have line-endings): lines would be duplicated and processing would jump to 100%+ complete. All a little fishy.

Hadoop Input Formats and Record Readers

The InputFormat in Hadoop does a couple of things. Most significantly, it provides the Splits that form the chunks that are sent to discrete Mappers.

Splits form the rough boundary of the data to be processed by an individual Mapper. The FileInputFormat (and it’s subclasses) generate splits on overall file size. Of course, it’s unlikely that all individual Records (a Key and Value that are passed to each Map invocation) lie neatly within these splits: records will often cross splits. The RecordReader, therefore, must handle this

… on whom lies the responsibilty to respect record-boundaries and present a record-oriented view of the logical InputSplit to the individual task.

In short, a RecordReader could scan from the start of a split for the start of it’s record. It may then continue to read past the end of it’s split to find the end. The InputSplit only contains details about the offsets from the underlying file: data is still accessed through the streams.

It seemed like the StreamXmlRecordReader was skipping around the underlying InputStream too much; reading records it wasn’t entitled to read. I tried my best at trying to understand the code but it was written a long while ago and is pretty cryptic to my limited brain.

I started trying to rewrite the code from scratch but it became pretty hairy very quickly. Take a look at the implementation of next() in LineRecordReader to see what I mean.

Mahout to the Rescue

After a little searching around on GitHub I found another XmlInputFormat courtesy of the Lucene sub-project: Mahout.

I’m happy to say it appears to work. I’ve just run a quick test on our 30 node cluster (via my VPN) and it processed the 8 gig file in about 10 minutes. Not bad.

For anyone trying to process XML with Hadoop: try Mahout’s XmlInputFormat.

MapReduce with Hadoop and Ruby

The quantity of data we analyse at Forward every day has grown nearly ten-fold in just over 18 months. Today we run almost all of our automated daily processing on Hadoop. The vast majority of that analysis is automated using Ruby and our open-source gem: Mandy.

In this post I’d like to cover a little background to MapReduce and Hadoop and a light introduction to writing a word count in Ruby with Mandy that can run on Hadoop.

Mandy’s aim is to make MapReduce with Hadoop as easy as possible: providing a simple structure for writing code, and commands that make it easier to run Hadoop jobs.

Since I left ThoughtWorks and joined Forward in 2008 I’ve spent most of my time working on the internal systems we use to work with our clients, track our ads, and analyse data. Data has become truly fundamental to our business.

I think it’s worth putting the volume of our processing in numbers: we track around 9 million clicks a day across our campaigns- averaged out across a day that’s over 100 every second (it actually peaks up to around 1400 a second).

 

We automate the analysis, storage, and processing of around 80GB (and growing) of data every day. In addition, further ad-hoc analysis is performed through the use of a related project: Hive (more on this in a later blog post).

We’d be hard pressed to work with this volume of data in a reasonable fashion, or be able to change what analysis we run, so quickly if it wasn’t for Hadoop and it’s associated projects.

Hadoop is an open-source Apache project that provides “open-source software for reliable, scalable, distributed computing”. It’s written in Java and, although relatively easy to get into, still falls foul of Java development cycle problems: they’re just too long. My colleague (Andy Kent) started writing something we could prototype our jobs in Ruby quickly, and then re-implement in Java once we understood things better. However, this quickly turned into our platform of choice.

To give a taster of where we’re going, here’s the code needed to run a distributed word count:

And how to run it?

mandy-hadoop wordcount.rb hdfs-dir/war-and-peace.txt hdfs-dir/wordcount-output

The above example comes from a lab Andy and I ran a few months ago at our office. All code (and some readmes) are available on GitHub.

What is Hadoop?

The Apache Hadoop project actually contains a number of sub-projects. MapReduce probably represents the core- a framework that provides a simple distributed processing model, which when combined with HDFS provides a great way to distribute work across a cluster of hardware. Of course, it goes without saying that it is very closely based upon the Google paper.

Parallelising Problems

Google’s MapReduce paper explains how, by adopting a functional style, programs can be automatically parallelised across a large cluster of machines. The programming model is simple to grasp and, although it takes a little getting used to, can be used to express problems that are seemingly difficult to parallelise.

Significantly, working with larger data sets (both storage and processing) can be solved by growing the cluster’s capacity. In the last 6 months we’ve grown our cluster from 3 ex-development machines to 25 nodes and taken our HDFS storage from just over 1TB to nearly 40TB.

An Example

Consider an example: perform a subtraction across two sets of items. That is

{1,2,3} - {1,2} = {3}

. A naive implementation might compare every item for the first set against those in the second set.

numbers = [1,2,3]
[1,2].each { |n| numbers.delete(n) }

This is likely to be problematic when we get to large sets as the time taken will grow very fast (m*n). Given an initial set of 10,000,000 items, finding the distinct items from a second site of even 1,000 items would result in 10 billion operations.

However, because of the way the data flows through a MapReduce process, the data can be partitioned across a number of machines such that each machine performs a smaller subset of operations with the result then combined to produce a final output.

It’s this flow that makes it possible to ‘re-phrase’ our solution to the problem above and have it operate across a distributed cluster of machines.

Overall Data Flow

The Google paper and a Wikipedia article provide more detail on how things fit together, but here’s the rough flow.

Map and reduce functions are combined into one or more ‘Jobs’ (almost all of our analysis is performed by jobs that pass their output to further jobs). The general flow can be seen to be made of the following steps:

  1. Input data read
  2. Data is split
  3. Map function called for each record
  4. Shuffle/sort
  5. Sorted data is merged
  6. Reduce function called
  7. Output (for each reducer) written to a split
  8. Your application works on a join of the splits

First, the input data is read and partitioned into splits that are processed by the Map function. Each map function then receives a whole ‘record’. For text files, these records will be a whole line (marked by a carriage return/line feed at the end).

Output from the Map function is written behind-the-scenes and a shuffle/sort performed. This prepares the data for the reduce phase (which needs all values for the same key to be processed together). Once the data has been sorted and distributed to all the reducers it is then merged to produce the final input to the reduce.

Map

The first function that must be implemented is the map function. This takes a series of key/value pairs, and can emit zero or more pairs of key/value outputs. Ruby already provides a method that works similarly: Enumerable#map.

For example:

names = %w(Paul George Andy Mike)
names.map {|name| {name.size => name}} # => [{4=>"Paul"}, {6=>"George"}, {4=>"Andy"}, {4=>"Mike"}]

The map above calculates the length of each of the names, and emits a list of key/value pairs (the length and names). C# has a similar method in ConvertAll that I mentioned in a previous blog post.

Reduce

The reduce function is called once for each unique key across the output from the map phase. It also produces zero or more key value pairs.

For example, let’s say we wanted to find the frequency of all word lengths, our reduce function would look a little like this:

def reduce(key, values)
  {key => values.size}
end

Behind the scenes, the output from each invocation would be folded together to produce a final output.

To explore more about how the whole process combines to produce an output let’s now turn to writing some code.

Code: Word Count in Mandy

As I’ve mentioned previously we use Ruby and Mandy to run most of our automated analysis. This is an open-source Gem we’ve published which wraps some of the common Hadoop command-line tools, and provides some infrastructure to make it easier to write MapReduce jobs.

Installing Mandy

The code is hosted on GitHub, but you can just as easily install it from Gem Cutter using the following:

$ sudo gem install gemcutter
$ gem tumble
$ sudo gem install mandy

… or

$ sudo gem install mandy --source http://gemcutter.org

If you run the mandy command you may now see:

You need to set the HADOOP_HOME environment variable to point to your hadoop install    :(
Try setting 'export HADOOP_HOME=/my/hadoop/path' in your ~/.profile maybe?

You’ll need to make sure you have a 0.20.x install of Hadoop, with HADOOP_HOME set and $HADOOP_HOME/bin in your path. Once that’s done, run the command again and you should see:

You are running Mandy!
========================

Using Hadoop 0.20.1 located at /Users/pingles/Tools/hadoop-0.20.1

Available Mandy Commands
------------------------
mandy-map       Run a map task reading on STDIN and writing to STDOUT
mandy-local     Run a Map/Reduce task locally without requiring hadoop
mandy-rm        remove a file or directory from HDFS
mandy-hadoop    Run a Map/Reduce task on hadoop using the provided cluster config
mandy-install   Installs the Mandy Rubygem on several hosts via ssh.
mandy-reduce    Run a reduce task reading on STDIN and writing to STDOUT
mandy-put       upload a file into HDFS

If you do, you’re all set.

Writing a MapReduce Job

As you may have seen earlier to write a Mandy job you can start off with as little as:

Of course, you may be interested in writing some tests for your functions too. Well, you can do that quite easily with the Mandy::TestRunner. Our first test for the Word Count mapper might be as follows:

Our implementation for this might look as follows:

Of course, our real mapper needs to do more than just tokenise the string. It also needs to make sure we downcase and remove any other characters that we want to ignore (numbers, grammatical marks etc.). The extra specs for these are in the mandy-lab GitHub repository but are relatively straightforward to understand.

Remember that each map works on a single line of text. We emit a value of 1 for each word we find which will then be shuffled/sorted and partitioned by the Hadoop machinery so that a subset of data is sent to each reducer. Each reducer will, however, receive all the values for that key. All we need to do in our reducer is sum all the values (all the 1’s) and we’ll have a count. First, our spec:

And our implementation:

In fact, Mandy has a number of built-in reducers that can be used (so you don’t need to re-implement this). To re-use the Sum-ming reducer replace the reduce statement above to reduce(Mandy::Reducers::SumReducer).

Our mandy-lab repository includes some sample text files you can run the examples on. Assuming you’re in the root of the repo you can then run the following to perform the wordcount across Alice in Wonderland locally:

$ mandy-local word_count.rb input/alice.txt output
Running Word Count...
/Users/pingles/.../output/1-word-count

If you open the /Users/pingles/.../output/1-word-count file you should see a list of words and their frequency across the document set.

Running the Job on a Cluster

The mandy-local command just uses some shell commands to imitate the MapReduce process. If you try running the same command over a larger dataset you’ll come unstuck. So let’s see how we run the same example on a real Hadoop cluster. Note, that for this to work you’ll need to either have a Hadoop cluster or Hadoop running locally in pseudo-distributed mode.

If you have a cluster up and running, you’ll need to write a small XML configuration file to provide the HDFS and JobTracker connection info. By default the mandy-hadoop command will look for a file called cluster.xml in your working directory. It should look a little like this:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://datanode:9000/</value>
  </property>
  <property>
    <name>mapred.job.tracker</name>
    <value>jobtracker:9001</value>
  </property>
  <property>
    <name>hadoop.job.ugi</name>
    <value>user,supergroup</value>
  </property>
</configuration>

Once that’s saved, you need to copy your text file to HDFS so that it can be distributed across the cluster ready for processing. As an aside: HDFS shards data in 64MB blocks across the nodes. This provides redundancy to help mitigate in the case of machine failure. It also means the job tracker (the node which orchestrates the processing) can move the computation close to the data and avoid copying large amounts of data unnecessarily.

To copy a file up to HDFS, you can run the following command

$ mandy-put my-local-file.txt word-count-example/remote-file.txt

You’re now ready to run your processing on the cluster. To re-run the same word count code on a real Hadoop cluster, you change the mandy-local command to mandy-hadoop as follows:

$ mandy-hadoop word_count.rb word-count-example/remote-file.txt word-count-example/output

Once that’s run you should see some output from Hadoop telling you the job has started, and where you can track it’s progress (via the HTTP web console):

Packaging code for distribution...
Loading Mandy scripts...

Submitting Job: [1] Word Count...
Job ID: job_200912291552_2336
Kill Command: mandy-kill job_200912291552_2336 -c /Users/pingles/.../cluster.xml
Tracking URL: http://jobtracker:50030/jobdetails.jsp?jobid=job_200912291552_2336

word-count-example/output/1-word-count

Cleaning up...
Completed Successfully!

If you now take a look inside ./word-count-example/output/1-word-count you should see a few part-xxx files. These contain the output from each reducer. Here’s the output from a reducer that ran during my job (running across War and Peace):

abandoned   54
abandoning  26
abandonment 14
abandons    1
abate   2
abbes   1
abdomens    2
abduction   3
able    107

Wrap-up

That’s about as much as I can cover in an introductory post for Mandy and Hadoop. I’ll try and follow this up with a few more to show how you can pass parameters from the command-line to Mandy jobs, how to serialise data between jobs, and how to tackle jobs in a map and reduce stylee.

As always, please feel free to email me or comment if you have any questions, or (even better) anything you’d like me to cover in a future post.

Ruby Influenced C#

Before joining my current project I spent about 4 months working with Ruby every day, the first time I’d done so for a few years. It was a glorious time: uncluttered syntax, closures, internal iterators, and with open classes, the ability to extend the ‘core’ at will.

Today I’m working with C# and .NET, and I’ve noticed that those 4 months with Ruby have changed the way I’ve been writing code. Most noticeably, I’m using anonymous delegates a lot more. But that’s not all.

I’ve found myself aching to use the List<T>’s ForEach method; I’m now wired to use Ruby’s internal iterators where instead of

List people = FindAllPeople();
foreach (Person person in people) {
   ...
}

I can instead do

List<Person> people = FindAllPeople();
people.ForEach(delegate(Person person) {
  Console.WriteLine(person.Name);
});

But, frequently I’m put off by the surrounding guff that’s needed to express the same and have almost always gone back to the more traditional external iterator-based approach. It’s simply too high-a-price to pay.

One of the largest smells I’ve noticed recently (to my mind) appears driven out of not having open classes and external iterators. If they were there, I’m sure people would use them. The result: all across the codebase, whenever you need to convert from type to another you’ll see

List<Person> people = FindAllPeople();
List<String> firstNames = new List<String>();

foreach (Person person in people) {
  firstNames.Add(person.Name);
}

This smells to me. But it really, really smells from having used Ruby where I would previously have written something as succinctly as this:

find_all_people.collect {|person| person.name}

(I’m sure other languages could do equally good things- but I’m familiar with Ruby, before the Pythonists pounce :p)

Well, turns out that you can get nearly there with C# 2.0 and .NET 2.0 with the almost certainly underused ConvertAll method (also part of List<T>).

List<Person> people = FindAllPeople();
List<String> names = people.ConvertAll(delegate(Person person) {
  return person.Name;
});

There’s still a fair bit of accidental complexity remaining- lot’s of delegate and type declarations.

C# 3.0 introduced lambda expressions and we can use that to bubble our soup down to a nice intentional broth even more. We can get rid of the delegate bumpf and let the compiler infer the type (we are still statically typed after all)

List<Person> people = FindAllPeople();List<String> names = people.ConvertAll(person => person.Name);

Next step, we can also infer the types for our local variables:

var people = FindAllPeople();
var names = people.ConvertAll(person => person.Name);

Pretty nice. Most of the code is focused on the task at hand, and on expressing the necessary complexity (what it means to convert people to names). Guess learning a new language each year has it’s benefits.

I’ve got another bit of Ruby influenced C# refactoring to cover (a somewhat declarative way of removing switch statements), hopefully I’ll get that posted tomorrow!

Prioritising Work

I was involved in the inception work (and the resulting delivery) for a project late summer last year. We estimated the total work to be nearly 500 units- too much to complete in the time we had. So, working with the client we cut it down to a reasonable scope (this client rocked at that!) of around a third.

What was really cool, however, was that a few months in after that initial scope had been delivered, we looked ahead for what we’d do next. According to our original inception, there was still more than double left to go, but, instead what we planned to do next was radically different. What we’d believed to be important 3 months ago, no longer was. The result was we delivered about 30% more, and about 50% overall was not part of that initial (500 unit) scope.

The real key was that our client had a very definite focus on work that was important (that is, work that delivers the most value) and avoided getting drawn into unimportant but urgent work, or left valuable work to the point it was always urgent. If you’re always working on urgent things you’re working at breaking point and miss out on the opportunity on higher value items that are less urgent.

Here’s a diagram to show the two

(More can be read about Covey’s grid on Wikipedia)

The right-most two quadrants (coloured blue and purple) are the key. Both represent sections that involve working on things that are valuable, i.e. those that are important to do. In contrast, 1 and 2 (the uncoloured boxes) are relatively unimportant and should demand less attention.

The rub (of course) is that urgent things tend to be shouty and demand attention. I would also say it’s often easier to measure how urgent something is rather than how important it is; as a result, urgency is an easier (and thus more likely) benchmark for prioritising tasks - despite ignoring whether the work is worth doing at all.

Some of the best people I’ve worked with (including our sponsor at our client last year) have a remarkable ability to cut through the context and spot what’s really important now; as opposed to just reacting to what’s demanding our attention now.

Poor Man's C# Singleton Checker

Paul Hammant wrote a nice article about how to refactor the “nest-of-singletons design” towards using dependency injection using Google’s Guice IoC Container.

Whilst waiting for one of my many builds to finish today, I figured I’d satisfy a curiosity - roughly how many singletons are there defined within this codebase. I fired up Cygwin and used the following

find . -type f -name “*.cs” | xargs cat | grep “public static [A-Za-z]\{1,100\} Instance” | wc -l

Result:

180

Yikes!

For Java, you can always use the Google Singleton Checker which also has some nice stuff about why they’re controversial.

(Update) Paul Hammant pointed out that his article wasn’t about refactoring out singletons, rather, breaking away from using the service locator to dependency injection. Apologies for muddling it up a little :)

Declarative Programming with Ruby

During the most recent ThoughtWorks away-day (a chance for the office to get together, catch-up, drink etc.), George and I presented on a number of Ruby and Rails lessons we learned from our (now previous) project. One of the most interesting sections (to us anyway) was on declarative programming, specifically, refactoring to a declarative design.

I guess much like DSLs, its easier to feel when you’re achieving something declarative as opposed to necessarily defining what makes it. But, I’ll try my clumsy best to define something.

Almost every language I’ve turned my hand to (save for Erlang) are imperative languages - where programs are written as sequences of operations, with changes of state. You determine the what and the how of the system - what to do and how to do it.

Declarative programming is an alternative paradigm, whereby code is expressed as what’s. How the system executes is someone else’s responsibility.

So, for the purposes of this discussion let’s consider that application code can be split into two groups


  1. Logic- the rules, the guts of things- that what’s.

  2. Control- statements about execution flow.

Interestingly, one of the principles listed in Kent Beck’s most recent book (Implementation Patterns) includes “Declarative Expression” - that you should be able to read what your code is doing, without having to understand the wider execution context.

Declarative languages are all around us, with most developers I’m guessing using them almost daily.

Think of SQL, when you write a statement such like SELECT [Name], [Age] FROM [Person] WHERE [Age] > 15 you’re making a statement about what you’d like, not how you get there- that’s up for the database engine to figure out. And a good thing to! Have you ever taken a look at an execution plan for modestly complex queries?

Closter to home - think of .NET and Java Annotations- where you can decorate constructs with additional behaviours. They look and feel like core extensions to the language, but are programmable and can be used to adapt the runtime behaviour of the system.

Before working for ThoughtWorks, I worked on a system where we used attributes to allow us to add validations to properties, allowing us to re-use code and extend easily.

[LengthMustBeAtLeast(6)]public property string FirstName{  get { ... }  set { ... }}

Everything was nicely decoupled, read well, and reduced the amount of clutter in our code. More importantly, our validation code was not spread throughout every setter of every property. We could isolate responsibilities making code easier to digest and understand, and test!

Onto Ruby. A substantial part of our project involved reading lots of CSV files from different sources to update our deal information. The answer was a kind of anticorruption layer (borrowing heavily from domain-driven design) for each different feed.

Quickly, we ended up with a few concrete classes and a base class that co-ordinated effort. Dependencies were shared both ways, imagine a number of template methods that are called in sequence to accomplish their work. Over time it grew complex, and with Ruby it’s a little tougher (than with languages like Java or .NET) to navigate and browse around the code without strong IDE support.

It was starting to get a little too complex and we felt we needed to change things, so we did (bolstered somewhat by having Jay with us).

Onto the code.

So imagine we have a class representing a Feed of information (read from a CSV file), and we want to be able to ask that Feed to provide us with a number of Deals. Internally, it will iterate over the items in the Feed, creating a Deal for each one (if possible).

Our first solution looked a little like this, firstly in the ‘abstract’ base class:

def create_deal  Deal.create(:network_name => network_name)end

and in the feed’s concrete implementation:

FIELD_INDICES = {:name => 1, :network_name => 2}def network_name  read_cell(FIELD_INDICES[:network_name])end

Our main feed class asks our implementation class for the network_name - a template method. Not bad, now build that up to tens of attributes, a bit longer. Since we’ve defined column indices in a constant, we also now have to navigate up and down lots to determine where we’re reading from. Add in a few other tables with look-ups for other bits that we need to do the mapping and it can get a little complex pretty quick.

Our code is not only made up of the stuff determining what it is to translate between a CSV representation of our Deals to an object model one, but also all of the code necessary to find out which CSV column we’re in, how we map that column etc. They were essentially just reading values from cells, no translation needed. Most of the code we had was infrastructural, and the logic (the what of our application) was hidden amongst the noise.

This complexity, combined with the split of flow between abstract and concrete classes, made it difficult to follow and understand. Our goal was to try and reduce each concrete implementation to a single page on our screens.

These little ‘mapping’ translation methods were our first target - reduce the amount of code for each of these to one line that described the mapping, rather than how we get it all.

Firstly, we introduced a convention - every attribute of a Deal could be retrieved by calling a deal_attribute_my_attribute style method. So, we renamed all our methods, ran the tests, and then started to make steps towards having each deal_attribute_blah method defined dynamically.

We went from:

def network  ...end

to

def deal_attribute_network  ...end

to … nothing.

Well, not quite. Instead, what we wanted was to have a method constructed by adding a class method to a module that we could mix-in. Then, we could just define the mapping and Ruby would wire up the rest. We settled on the following syntax

deal_attribute :name => 'NAME'

Neat. A little classeval and instanceeval magic later we were able to push our infrastructural code out of our concrete feed class and into the co-ordinator.

Our code is now more declarative. We’re stating what we need to do our work, rather than worrying about how we get at it. Not only that, for the common case (where we may be just moving values from one place to another) there’s no need to do anything more than describe that relationship. Declarative programming makes it much easier to express important relationships. It’s now much easier to see the relationship between the name attribute of a Deal and the NAME column for this CSV feed.

Notice also how we’re pushing our dependencies up to our caller- the coupling is now one-way. We state what we need from our caller (the main deal feed class) - our caller is able to then pass the information on. We’re just answering questions, rather than answering questions and asking questions (of our caller).

The syntax also lends itself to explaining a dependency, that an attribute of our deal is read from ‘NAME’ (for example).

That’s great, next step was to tidy up some of the slightly more complex examples where we do some additional translation. For example, where we take the name of something and we need to pull back an object from the database instead. Let’s say we keep track of the Phone that a Deal is for.

So, from

def deal_attribute_phone  Phone.find(deal_attribute_brand, deal_attribute_model)end

to

deal_attribute(:phone => ['MAKE', 'MODELNAME']) do |brand, model|  Phone.find(brand, model)end

We’ve extended the syntax to reveal that this translation needs values from both the ‘MAKE’ and ‘MODELNAME’ columns. From our perspective, we’re pushing responsibility up and keeping our code focused on what we need to map this attribute.

This is a little more complex to achieve since we’re also passing arguments across (and our deal_attribute translator methods also sometimes need to access instance variables) so we need to use instance_exec instead of the standard instance_eval.

The end result was feed classes that looked as follows

class MySpecialFeed  deal_attribute :name => 'NAME'  deal_attribute(:description => 'DESC') {|desc| cleanse_description(desc) }  deal_attribute(:phone => ['MAKE', 'MODELNAME']) do |brand, model|    Phone.find(brand, model)  end  ...end

In total, it took us probably just over a day to refactor the code for all of our classes. Most of which we managed to get down to some 40 or 50 lines total. We didn’t refactor all the code, so there’s still potential for exploiting it further but it was definitely a very exciting thing to see happen.

Copying Classes

From across the desk, George asks “can you copy classes in Ruby?”. We talk about it quickly and reason that since everything’s an Object (even classes), you probably can. Since the constant isn’t changed or duplicated (you’re essentially assigning a new one) then it ought to be possible.

Turns out it is!

class First  def initialize    @value = 99  end  def say_value    @value  endendFirst.new.say_value # => 99
Second = First.cloneSecond.class_eval do  define_method :say_value do    @value + 100  endendSecond.new.say_value # => 199

Neat.

I’m not sure quite why you would want to clone a class to take advantage of re-use - rather than extract to a module (and share the implementation that way) or, if there’s a strong relationship that doesn’t violate the LSP etc. then look for some kind of inheritance-based design.

But, I guess you could work some kind of cool ultra-dynamic super-meta system from it. Perhaps someone with way more of a Ruby-thinking brain than me could offer some thoughts?

Watch out for the Monkey Patch

The project I’m currently working on uses both the Asset Packager and Distributed Assets to ensure we have only a few external assets, and that we can load assets across more than one host - all so that the pages for our site load nice and quick.

Unfortunately, wiring in the Asset Packager plugin caused the Distributed Assets plugin to break, and I spent an hour or two tracking it down yesterday. The cause? Asset Packager redefines the compute_public_path method.

# rewrite compute_public_path to allow us to not include the query string timestamp    # used by ActionView::Helpers::AssetTagHelper    def compute_public_path(source, dir, ext=nil, add_asset_id=true)      source = source.dup      source << ".#{ext}" if File.extname(source).blank? && ext      unless source =~ %r{^[-a-z]+://}        source = "/#{dir}/#{source}" unless source[0] == ?/        asset_id = rails_asset_id(source)        source << '?' + asset_id if defined?(RAILS_ROOT) and add_asset_id and not asset_id.blank?        source = "#{ActionController::Base.asset_host}#{@controller.request.relative_url_root}#{source}"      end      source    end

Distributed Assets works by chaining compute_public_path - decorating the calculated path, adding the asset host prefix onto the url. But, Asset Packager works by defining the method into ActionView::Base. So, when DistributedAssets::AssetTagHelper is included with ActionView::Helpers::AssetTagHelper, it chains a (now) hidden method.

But, the only places that use the new compute_public_path code inside the Asset Packager Helper (which just avoids using the query string timestamp) is within Asset Packager itself.

So, I tweaked the implementation of AssetPackageHelper to

def compute_public_path_for_packager(source, dir, ext, add_asset_id=true)  path = compute_public_path(source, dir, ext)  return path if add_asset_id  path.gsub(/\?\d+$/, '')enddef javascript_path(source)  compute_public_path_for_packager(source, 'javascripts', 'js', false)       end

Beware the monkey patch.