We're happy to announce the public release of a tool we've been using in production for a while now: Bifrost.
We use Bifrost to incrementally archive all our Kafka data into Amazon S3; these transaction logs can then be ingested into our streaming data pipeline (we only need to use the archived files occasionally when we radically change our computation).
There are a few other projects that scratch the same itch: notably Secor
from Pinterest and Kafka's old hadoop-consumer
. Although Secor doesn't rely on running Hadoop jobs, it still uses Hadoop's SequenceFile file format; sequence files allow compression and distributed splitting operations, as well as just letting you access the data in a record-oriented way. We'd been using code similar to Kafka's hadoop-consumer for a long time but it was running slowly and didn't do a great job of watching for new topics and partitions.
We wanted something that didn't introduce the cascade of Hadoop dependencies and would be able to run a little more hands-off.
Bifrost is the tool we wanted. It has a handful of configuration options that need to be set (e.g. S3 credentials, Kafka ZooKeeper consumer properties) and will continually monitor Kafka for new topics/partitions and create consumers to archive the data to S3.
Here's an example configuration file:
Bifrost is written in Clojure and can be built using Leiningen
. If you want to try it locally you can just `lein run`, or, for production you can build an uberjar and run:
Data is stored in "directories" in S3: s3://<bucket>/<kafka-consumer-group-id>/<topic>/partition=<partition-id>/0000000.baldr.gz (the filename is the starting offset for that file.
We use our Baldr
file format: the first 8-bytes indicate the length of the record, then the following n-bytes represent the record itself, then another 8-byte record length etc. It provides almost all of what we need from SequenceFiles but without the Hadoop dependencies. We have a Clojure implementation but it should be trivial to write in any language. We also compress the whole file output stream with Gzip to speed up uploads and reduce the amount we store on S3.
Happy Kafka archiving!