Introducing Blueshift: Automated Amazon Redshift ingestion from Amazon S3

I’m very pleased to say I’ve just made public a repository for a tool we’ve built to make it easier to ingest data automatically into Amazon Redshift from Amazon S3: https://github.com/uswitch/blueshift.

Amazon Redshift is a wonderfully powerful product, if you’ve not tried it yet you should definitely take a look; I’ve written before about the value of the analytical flow it enables.

However, as nice as it is to consume data from, ingesting data is a little less fun:

  1. Forget about writing raw INSERT statements: we saw individual inserts take on the order of 5 or 6 seconds per record. Far too slow to support our usual stream let alone the flood given we reprocess the historical transaction log.
  2. Loading data from S3 is definitely the right thing to do. And, if you have a purely immutable append-only flow, this is trivial. However, if you need upsert behaviour you’ll have to write it yourself. Whilst not complicated, we have many different systems and teams pushing data into Redshift: each of these have a slightly different implementation of the same thing.
  3. We’ve seen lots of SQLException 1023 errors when we were reprocessing the historical transaction log across many different machines (our transaction log is itself already on S3 stored as lots of large Baldr files); with n machines in the cluster we’d end up with n transactions inserting into the same table.
  4. Redshift performs best when running loads operating over lots of files from S3: work is distributed to the cluster nodes and performed in parallel. Our original solution ended up with n transactions loading a smaller number of files.

Rationale

Blueshift makes life easier by taking away the client’s need to talk directly to Redshift. Instead, clients write data to S3 and Blueshift automatically manages the ingestion into Redshift.

Blueshift is a standalone service written in Clojure (a dialect of Lisp that targets the JVM) that is expected to be deployed on a server. It is configured to watch an S3 bucket; when it detects new data files and a corresponding Blueshift manifest it will perform an upsert transaction for all the files with the additional benefit of being a single import for a large number of files.

Using

You can build and run the application with Leiningen, a Clojure build tool:

The configuration file requires a minimal number of settings:

Most of it is relatively obvious but for the less obvious bits:

  • :key-pattern is a regex used to filter the directories watched within the S3 bucket; this makes it easier to have a single bucket with data from different environments.
  • :telemetry :reporters configures the Metrics reporters; a log reporter is included but we have an extra project (blueshift-riemann-metrics) for pushing data to Riemann.

Once the service is running you just need to write your delimited data files to the S3 bucket being watched and create the necessary Blueshift manifest. Your S3 layout might look something like this:

When Blueshift polls S3 it will pick up that 2 directories exist and create a watcher for each directory (directory-a/foo and directory-b in our example).

When the directory-a watcher runs it will notice there are 2 data files to be imported and a Blueshift manifest: manifest.edn: this is used to tell Blueshift where and how to import the data.

Of note in the example above:

  • :pk-columns is used to implement the merge upsert: rows in the target table (testing in the above example) that exist in our imported data are removed before the load: Redshift doesn’t enforce PK constraints so you have to delete the data first.
  • :options: can contain any of the options that Redshift’s COPY command supports.
  • :data-pattern helps Blueshift identify which files are suitable for being imported.

When the data has been imported successfully the data files are deleted from S3 leaving the Blueshift manifest ready for more data files.

Summary

We’re delighted with Amazon’s Redshift service and have released Blueshift to make it easier for people to optimally ingest data from their applications via S3 without talking to Redshift directly.

I’d love to hear from people if they find Blueshift useful and I’d be especially delighted for people to contribute (it’s currently less than 400 lines of Clojure).

43 responses
Just curios what is the difference with AWS Datapipeline setup, except that it Blueshift is selfhosted? We using Datapipeline and quite happy with it, RedshiftCopyActivity allow fully automatic data import from s3, with support of upserting data.
Good question. We did look at the data pipeline docs before we started, I think we probably edged on the side of building it ourselves to be simpler: clients interact with just S3, no need for dealing with multiple AWS APIs and the data pipeline documentation didn't make it appear like a simple thing to setup. Glad to hear you're having success with data pipeline- it definitely sounds like it would be useful in other situations.
can this and bifrost work seamlessly?.. i.e. will blueshift read baldr format?
They can't be used seamlessly: Blueshift essentially just manages the insert operation of input data (as per Redshift's copy command)- so it's restricted to delimited data that Redshift can interpret. They're really meant to be used independently for different purposes (although we of course use them both for our production data flows). Hope that helps? What were/are you trying to do? Paul
very help full.but it is in cloujure.is there any java/python version. my use case is i need to load data from s3 to redshift tables in every 5 mins incrementally.
It is in Clojure but you shouldn't need to modify anything for your situation. Would it be helpful to have a pre-built package you could install directly?
is this works On windows.how would i run it on windows.
ioii
Hi Paul. I was thinking about of implementing something similar, but based on new AWS lambda. Have you by any chance looked into it, or have any thoughts on it ? Lambda seems to me the perfect solution because writes to S3 would trigger the lambdas and wouldn't need to have jobs running on my own server (and at an extreme cheap price of course). Thanks. Great blog.
Ricardo, Very Interesting idea. I'd be interested to see how it goes. I've not looked much into the lambda service yet but sounds like a cool project to experiment with. We started Blueshift to enforce a kind of organisation-wide serialization of batches- so we'd only be performing one load per table at any time, I'm not sure how easy it would be to translate those semantics to AWS Lambda. But, that may not be useful/applicable for all. I'd definitely encourage you to give it a try- I'd love to see how it goes!
Hi, I am using Bifrost to upload my kafka messages to S3. When I use Redshift's default COPY command on the gz files that Bifrost produces, it complains about data format (Missing newline: Unexpected character 0x0 found at location 0). I suspect that's because of the way Baldr file format is designed (First and last 8 bytes for record length). Using Blueshift, can we upload the output files from Bifrost to Redshift? Apologies if my question sound naive. I am new to kafka, AWS world and trying to get my feet firm. Thanks, Pravesh
Sorry, scratch my previous comment. This has already been discussed above. Thanks.
31 visitors upvoted this post.