I'd been collecting statuses through their streaming API (using some code from a project called Twidoop). After wrestling with formats for a little while we realised that each JSON record was separated by a null character (all bits zeroed). There didn't seem to be an easy way to change the terminating character for the
TextInputFormat that we'd normally use so we wrote our own
InputFormat and I'm posting it here for future reference (and anyone else looking for a Hadoop
InputFormat to process similar files).
We added it to my fork of cascading-clojure but it will work in isolation.