Archive for

May 2011

Using partial to tidy Clojure tests

I was reading through getwoven’s infer code on GitHub and noticed a nice use of partial to tidy tests.

Frequently there’ll be a function under test that will be provided multiple inputs with perhaps only a single parameter value changing per example. Using partial you can build a function that closes over it’s variable arguments letting you tidy up the call to the single (varying) parameter.

The example I found was in infer’s sparse_classification_test.clj but I’ve written a quick gist to demonstrate the point:

Filed under  //  clojure   testing  
Posted

Recognising SSL Requests in Rack with Custom Headers

For a while now I’ve been working on replacing bits of uSwitch.com’s .NET heritage with some new Clojure and Ruby blood. A problem with one of the new applications came up when we needed it to recognise (and behave differently) when an SSL secured request was made because of non-standard headers (although I’m yet to successfully Google the RFC where HTTP_X_FORWARDED_PROTO is defined :).

Polyglot Deployment

New applications are deployed on different machines but proxied by the old application servers- letting us mount new applications inside the current website. Although the servers run IIS and ASP.NET they’re configured much like nginx’s proxy_pass module (for example). You set the matching URL and the URL you’d like to proxy to. uSwitch.com also use hardware load balancers in front of the application and web servers.

Rack Scheme Detection

Recent releases of Rack include a ssl? method on the Request that return true if the request is secure. This, internally, calls scheme to check whether any of the HTTP headers that specify an HTTPS scheme are set (the snippet from Rack’s current code is below).

Unfortunately, this didn’t recognise our requests as being secure. The hardware load-balancer was setting a custom header that wasn’t anything Rack would expect. The quickest (and perhaps dirty?) solution- a gem, rack-scheme-detect, that allows additional mappings to be provided- the rest of Rack will continue to work (and all of the other middleware and apps above it).

The following snippet shows how it can be used inside a classic Sinatra application to configure the additional mapping.

Of course we’ll be making sure we also set one of the standard headers. But, for anyone who needs a quick solution when that’s not immediately possible, hopefully rack-scheme-detect might save some time.

Filed under  //  rack   ruby   ssl  
Posted

Multiple connections with Hive

We’ve been using Hadoop and Hive at Forward for just over a year now. We have lots of batch jobs written in Ruby (using our Mandy library) and a couple in Clojure (using my fork of cascading-clojure). This post covers a little of how things hang together following a discussion on the Hive mailing list around opening multiple connections to the Hive service.

Hive has some internal thread safety problems (see HIVE-80) that meant we had to do something a little boutique with HAProxy given our usage. This article covers a little about how Hive fits into some of our batch and ad-hoc processing and how we put it under HAProxy.

Background

Hive is used for both programmatic jobs and interactively by analysts. Interactive queries come via a web-based product, called CardWall, that was built for our search agency. In both cases, behind the scenes, we have another application that handles scheduling jobs, retrying when things fail (and alerting if it’s unlikely to work thereafter).

Batch Scheduling

Internally, this is how things (roughly) look:

Batch processing

The Scheduler is responsible for knowing which jobs need to run and queuing work items. Within the batch processing application an Agent will pick up an item and check whether it’s dependencies are available (we use this to integrate with all kinds of other reporting systems, FTP sites etc.). If things aren’t ready it will place the item back on the queue with a delay (we use Beanstalkd which makes this all very easy).

Enter Hive

You’ll notice from the first screenshot we have some Hive jobs that are queued. When we first started using Hive we started using a single instance of the HiveServer service started as follows: $HIVE_HOME/bin/hive --service hiveserver. Mostly this was using a Ruby Gem called RBHive.

This worked fine when we were using it with a single connection, but, we ran into trouble when writing apps that would connect to Hive as part of our batch processing system.

Originally there was no job affinity, workers would pick up any piece of work and attempt it, and with nearly 20 workers this could heavily load the Hive server. Over time we introduced different queues to handle different jobs, allowing us to allocate a chunk of agents to sensitive resources (like Hive connections).

Multiple Hive Servers

We have a couple of machines that run multiple HiveServer instances. We use Upstart to manage these processes (so have hive1.conf, hive2.conf… in our /etc/init directory). Each config file specifies a port to open the service on (10001 in the example below).

These are then load-balanced with HAProxy, the config file we use for that is below

Filed under  //  beanstalkd   hadoop   haproxy   hive   ruby  
Posted
Fork me on GitHub