Introducing Bifrost: Archive Kafka data to Amazon S3

We're happy to announce the public release of a tool we've been using in production for a while now: Bifrost

We use Bifrost to incrementally archive all our Kafka data into Amazon S3; these transaction logs can then be ingested into our streaming data pipeline (we only need to use the archived files occasionally when we radically change our computation).

Rationale

There are a few other projects that scratch the same itch: notably Secor from Pinterest and Kafka's old hadoop-consumer. Although Secor doesn't rely on running Hadoop jobs, it still uses Hadoop's SequenceFile file format; sequence files allow compression and distributed splitting operations, as well as just letting you access the data in a record-oriented way. We'd been using code similar to Kafka's hadoop-consumer for a long time but it was running slowly and didn't do a great job of watching for new topics and partitions.

We wanted something that didn't introduce the cascade of Hadoop dependencies and would be able to run a little more hands-off.

Bifrost

Bifrost is the tool we wanted. It has a handful of configuration options that need to be set (e.g. S3 credentials, Kafka ZooKeeper consumer properties) and will continually monitor Kafka for new topics/partitions and create consumers to archive the data to S3. 

Here's an example configuration file:

Bifrost is written in Clojure and can be built using Leiningen. If you want to try it locally you can just `lein run`, or, for production you can build an uberjar and run:


Data is stored in "directories" in S3: s3://<bucket>/<kafka-consumer-group-id>/<topic>/partition=<partition-id>/0000000.baldr.gz (the filename is the starting offset for that file.

We use our Baldr file format: the first 8-bytes indicate the length of the record, then the following n-bytes represent the record itself, then another 8-byte record length etc. It provides almost all of what we need from SequenceFiles but without the Hadoop dependencies. We have a Clojure implementation but it should be trivial to write in any language. We also compress the whole file output stream with Gzip to speed up uploads and reduce the amount we store on S3.

Happy Kafka archiving!
6 responses
Have you evaluated Camus (https://github.com/linkedin/camus) ? I think I heard it watches for new topics/partitions...
Hi, We did have a look at Camus a while ago, it appeared to be based on the original hadoop-consumer. We didn't move to it at the time because the original hadoop consumer we had was a little simpler (albeit substantially less useful with auomatic detection of partitions/topics etc. :). A big thing for us in writing Bifrost was to no longer have all the Hadoop dependencies spread throughout our systems- reading/writing SequenceFiles is no fun as a result. Ultimately we didn't want anything Hadoop-related in there, file formats or job processing.
What throughput do you get? What are your gating factors: Kafka input, gzip, bandwidth to S3? Also, do you upload to S3 from inside or outside of Amazon?
We can comfortably handle our load (a few hundred messages/second, most messages relatively small- perhaps a few kilobytes) on a single EC2 instance (c3.large currently). Looking at our metrics dashboard we currently achieve a maximum write throughput to S3 at around 11MB/s. Of course this is from within EC2 and from a single instance. CPU load is pretty minimal (the machine's current 1-minute load average is 0.26), and thats writing GZip compressed records. I seem to remember network being the biggest bottleneck (both consuming from Kafka and writing to S3) when we ran some tests whilst we were building the app. More expensive EC2 instance types have better performance but we've had no problems getting all data replicated using the smaller types. We also didn't look into having multiple Bifrost instances running consuming different topic partitions but I would expect that to improve the CPU/network balance. Sorry, I don't have historical data to hand but we never have much of an issue with lag and the quantity of data is easily handled by our setup. For reference, our Kafka boxes are on relatively old EC2 hardware that we'll be migrating soon- again, I'd expect to see better networking and disk performance but our volumes are under current capacity. Hope that's helpful?
Hi, I am backing up an existing kafka broker. What kind of payloads are you using? And do you have examples of deserializing same? I see there is a baldr-seq available and I get byte[] back for a record (briefly playing with this so far :-) )... thanks in advance
UPDATE to my previous question above... I answered my own question. I know I am dropping json as string into the payload... It seems I can easily reconstruct the json STRING from the byte[] as normal...