At uSwitch.com we’re in the process of migrating a lot of our data infrastructure to be much lighter-weight: pure Clojure and supporting libraries, rather than sitting atop Hadoop and other chunkier frameworks.
A lot of our activity stream data is currently archived in files on Amazon S3- specifically, Hadoop SequenceFiles; these are record-oriented files that contain key/value pairs suitable for MapReduce-style processing.
Working with SequenceFile formatted files requires a dependency on org.apache.hadoop/hadoop-core
and a ton of transitive dependencies. Further, if you’re compressing the contents of the files (Hadoop SequenceFile’s support both record and block compression with Deflate, GZip and Snappy codecs) with GZip or Snappy compression you’ll need the hadoop-native
lib which is a real effort/impossible to build on anything but Linux.
Being able to write arbitrary bytes to a file is really useful for serialization, but, we really need message/record boundaries when consuming those records back.
We were convinced that this file format must exist already but couldn’t find anything so we wrote a small library called Baldr for working with records of bytes.
We’re still convinced this kind of thing must already exist somewhere else or at least be called something in particular. But, in the meantime it solves the problem neatly (and with far fewer dependencies).