Introducing Blueshift: Automated Amazon Redshift ingestion from Amazon S3

I’m very pleased to say I’ve just made public a repository for a tool we’ve built to make it easier to ingest data automatically into Amazon Redshift from Amazon S3: https://github.com/uswitch/blueshift.

Amazon Redshift is a wonderfully powerful product, if you’ve not tried it yet you should definitely take a look; I’ve written before about the value of the analytical flow it enables.

However, as nice as it is to consume data from, ingesting data is a little less fun:

  1. Forget about writing raw INSERT statements: we saw individual inserts take on the order of 5 or 6 seconds per record. Far too slow to support our usual stream let alone the flood given we reprocess the historical transaction log.
  2. Loading data from S3 is definitely the right thing to do. And, if you have a purely immutable append-only flow, this is trivial. However, if you need upsert behaviour you’ll have to write it yourself. Whilst not complicated, we have many different systems and teams pushing data into Redshift: each of these have a slightly different implementation of the same thing.
  3. We’ve seen lots of SQLException 1023 errors when we were reprocessing the historical transaction log across many different machines (our transaction log is itself already on S3 stored as lots of large Baldr files); with n machines in the cluster we’d end up with n transactions inserting into the same table.
  4. Redshift performs best when running loads operating over lots of files from S3: work is distributed to the cluster nodes and performed in parallel. Our original solution ended up with n transactions loading a smaller number of files.

Rationale

Blueshift makes life easier by taking away the client’s need to talk directly to Redshift. Instead, clients write data to S3 and Blueshift automatically manages the ingestion into Redshift.

Blueshift is a standalone service written in Clojure (a dialect of Lisp that targets the JVM) that is expected to be deployed on a server. It is configured to watch an S3 bucket; when it detects new data files and a corresponding Blueshift manifest it will perform an upsert transaction for all the files with the additional benefit of being a single import for a large number of files.

Using

You can build and run the application with Leiningen, a Clojure build tool:

The configuration file requires a minimal number of settings:

Most of it is relatively obvious but for the less obvious bits:

  • :key-pattern is a regex used to filter the directories watched within the S3 bucket; this makes it easier to have a single bucket with data from different environments.
  • :telemetry :reporters configures the Metrics reporters; a log reporter is included but we have an extra project (blueshift-riemann-metrics) for pushing data to Riemann.

Once the service is running you just need to write your delimited data files to the S3 bucket being watched and create the necessary Blueshift manifest. Your S3 layout might look something like this:

When Blueshift polls S3 it will pick up that 2 directories exist and create a watcher for each directory (directory-a/foo and directory-b in our example).

When the directory-a watcher runs it will notice there are 2 data files to be imported and a Blueshift manifest: manifest.edn: this is used to tell Blueshift where and how to import the data.

Of note in the example above:

  • :pk-columns is used to implement the merge upsert: rows in the target table (testing in the above example) that exist in our imported data are removed before the load: Redshift doesn’t enforce PK constraints so you have to delete the data first.
  • :options: can contain any of the options that Redshift’s COPY command supports.
  • :data-pattern helps Blueshift identify which files are suitable for being imported.

When the data has been imported successfully the data files are deleted from S3 leaving the Blueshift manifest ready for more data files.

Summary

We’re delighted with Amazon’s Redshift service and have released Blueshift to make it easier for people to optimally ingest data from their applications via S3 without talking to Redshift directly.

I’d love to hear from people if they find Blueshift useful and I’d be especially delighted for people to contribute (it’s currently less than 400 lines of Clojure).