Processing null character terminated text files in Hadoop with a NullByteTextInputFormat

It was Forward's first hackday last Friday. Tom Hall and I worked on some collaborative filtering using Mahout and Hadoop to process some data collected from Twitter.

I'd been collecting statuses through their streaming API (using some code from a project called Twidoop). After wrestling with formats for a little while we realised that each JSON record was separated by a null character (all bits zeroed). There didn't seem to be an easy way to change the terminating character for the TextInputFormat that we'd normally use so we wrote our own InputFormat and I'm posting it here for future reference (and anyone else looking for a Hadoop InputFormat to process similar files).

We added it to my fork of cascading-clojure but it will work in isolation.

pingles

Blog of Paul Ingles

Processing null character terminated text files in Hadoop with a NullByteTextInputFormat