pingles

Ok, so it’s a slightly fanboy-ish title but I’m starting to really like the early experimentation we’ve been doing with Amazon’s Redshift service at uSwitch.

Our current data platform is a mix of Apache Kafka, Apache Hadoop/Hive and a set of heterogenous data sources mixed across the organisation (given we’re fans of letting the right store find it’s place).

The data we ingest is reasonably sizeable (gigabytes a day); certainly enough to trouble the physical machines uSwitch used to host with. However, for nearly the last 3 years we’ve been breaking uSwitch’s infrastructure and systems apart and it’s now much easier to consume whatever resources you need.

Building data systems on immutable principles also makes this kind of experimentation so much easier. For a couple of weeks we (Paul and I) have been re-working some of our data warehousing ETL to see what a Redshift analytics world looks like.

Of course it’s possible to just connect any JDBC SQL client to Redshift but we want to be able to do some more interactive analysis on the data we have. We want an Analytics REPL.

Redshift in R

I’m certainly still a novice when it comes to both statistical analyses and R but it’s something I’m enjoying- and I’m lucky to work with people who are great at both.

R already has a package for connecting to databases using JDBC but I built a small R package that includes both the Postgresql 8.4 JDBC driver and a few functions to make it nicer to interact with: Redshift.R. N.B. this was partly so I could learn about writing R packages, and partly about making it trivial for other R users in the company to get access to our experimental cluster.

The package is pretty easy to install- download the tarball, uncompress and run an R statement. The full instructions are available on the project’s homepage. Once you’ve installed it you’re done- no need to download anything else.

Flow

What I found really interesting, however, was how I found my workflow once data was accessible in Redshift and directly usable from inside my R environment; the 20 minute lead/cycle time for a Hive query was gone and I could work interactively.

I spent about half an hour working through the following example- it’s pretty noddy analytics but shows why I’m starting to get a little excited about Redshift: I can work mostly interactively without needing to break my work into pieces and switch around the whole time.

	require(redshift)

	conn <- redshift.connect("jdbc:postgresql://mycluster.redshift.amazonaws.com:5439/data", "user", "pass")

	# we can retrieve a list of tables
	tables <- redshift.tables(conn)

	# and get some info about the columns in one of those tables
	cols <- redshift.columns(conn, "weblog")

	# lets run a simple query, a number of requests (by their status code) grouped by day
	statuses_by_day <- redshift.query(conn, "SELECT DATE(time_stamp) as dated, status, COUNT(1) as request_count",
	"FROM weblog",
	"GROUP BY DATE(time_stamp), status",
	"ORDER BY dated"))
	statuses_by_day$dated <- as.Date(statuses_by_day$dated)
	statuses_by_day$status <- as.factor(statuses_by_day$status)

	# lets do a little scatter plot to see how this looks
	require(ggplot2)
	p <- ggplot(statuses_by_day, aes(x=dated, y=request_count))
	p + geom_point(aes(color=status))

	# lets see if 200 success requests correlate with the number of
	# whatchamajig transactions we make by day
	success.requests.byday <- redshift.query(conn, "SELECT DATE(time_stamp) as dated, COUNT(1) as requests",
	"FROM weblog",
	"WHERE status = 200",
	"GROUP BY DATE(time_stamp)",
	"ORDER BY dated ASC"))

	# next, number of transactions by day too
	txns.byday <- redshift.query(conn, "SELECT DATE(time_stamp) as dated, COUNT(1) as transactions",
	"FROM whatchamajig_transactions",
	"GROUP BY DATE(time_stamp)",
	"ORDER BY dated ASC"))

	traffic.txns.byday <- merge(success.requests.byday, txns.byday, by="dated")
	traffic.txns.byday$dated <- as.Date(traffic.txns.byday$dated)

	# quick plot to see whether it looks realistic?
	traffic.txns.p <- ggplot(traffic.txns.byday, aes(x=requests, y=transactions))
	traffic.txns.p + geom_point()

	# lets use cor.test to test and measure significance
	cor.test(traffic.txns.byday$requests, traffic.txns.byday$transactions, alternative="greater")

	# interesting, but we may have influenced the data by only including successful
	# requests, redirects and errors may also be interesting

	# ok, so how about correlation to all requests
	requests.byday <- redshift.query(conn, "SELECT DATE(time_stamp) as dated, COUNT(1) as requests",
	"FROM weblog",
	"GROUP BY DATE(time_stamp)",
	"ORDER BY dated ASC"))

	all.traffic.txns <- merge(requests.byday, txns.byday, by="dated")
	all.traffic.txns$dated <- as.Date(all.traffic.txns$dated)

	all.traffic.txns.p <- ggplot(all.traffic.txns, aes(x=requests, y=transactions))
	all.traffic.txns.p + geom_point()

	cor.test(all.traffic.txns$requests, all.traffic.txns$transactions, alternative="greater")

	# what about traffic to _only_ a specific section... is that stronger?
	whatchamajig.requests.byday <- redshift.query(conn, "SELECT DATE(time_stamp) as dated, COUNT(1) as requests",
	"FROM weblog",
	"WHERE request_path LIKE '%whatchamajig%'",
	"GROUP BY DATE(time_stamp)",
	"ORDER BY dated ASC"))
	whatchamajig.requests.txns <- merge(whatchamajig.requests.byday, txns.byday, by="dated")
	whatchamajig.requests.txns$dated <- as.Date(whatchamajig.requests.txns$dated)

	# lets look at the plot
	whatchamajig.requests.txns.p <- ggplot(whatchamajig.requests.txns, aes(x=requests, y=transactions))
	whatchamajig.requests.txns.p + geom_point()

	# and one last final correlation test :)
	cor.test(whatchamajig.requests.txns$requests, whatchamajig.requests.txns$transactions, alternative="greater")

view raw redshift_noddy_example.r hosted with ❤ by GitHub

Disclosure

It would be remiss of me not to mention that R already has packages for connecting to Hadoop and Hive, and work to provide faster querying through tools like Cloudera’s Impala. My epiphany is probably also very old news to those already familiar with connecting to Vertica or Teradata warehouses with ODBC and R.

The killer thing for me is that it cost us probably a few hundred dollars to create a cluster with production data in, kick the tyres, and realise there’s a much better analytics cycle for us out there. We're really excited to see where this goes.

Kafka is a high-throughput, persistent, distributed messaging system that was originally developed at LinkedIn. It forms the backbone of uSwitch.com’s new data analytics pipeline and this post will cover a little about Kafka and how we’re using it.

Kafka is both performant and durable. To make it easier to achieve high throughput on a single node it also does away with lots of stuff message brokers ordinarily provide (making it a simpler distributed messaging system).

Messaging

Over the past 2 years we’ve migrated from a monolithic environment based around Microsoft .NET and SQL Server to a mix of databases, applications and services. These change over time: applications and servers will come and go.

This diversity is great for productivity but has made data analytics as a whole more difficult.

We use Kafka to make it easier for the assortment of micro-applications and services, that compose to form uSwitch.com, to exchange and publish data.

Messaging helps us decouple the parts of the infrastructure letting consumers and producers evolve and grow over time with less centralised coordination or control; I’ve referred to this as building a Data Ecosystem before.

Kafka lets us consume data in realtime (so we can build reactive tools and products) and provides a unified way of getting data into long-term storage (HDFS).

Consumers and producers

Kafka’s model is pretty general; messages are published onto topics by producers, stored on disk and made available to consumers. It’s important to note that messages are pulled by consumers to avoid needing any complex throttling in the event of slow consumption.

Kafka doesn’t dictate any serialisation it just expects a payload of byte[]. We’re using Protocol Buffers for most of our topics to make it easier to evolve schemas over time. Having a repository of definitions has also made it slightly easier for teams to see what events they can publish and what they can consume.

This is what it looks like in Clojure code using clj-kafka.

	(use 'clj-kafka.producer)
	(use 'clj-kafka.consumer.zk)

	(def config {"zk.connect" "localhost:2181"
	"groupid" "my-task-group"})

	;; producing
	(def p (producer config))

	(send-messages p "test" (message (.getBytes "payload")))


	;; consumin'
	(with-resource [c (consumer config)]
	shutdown
	(take 5 (messages c "test")))

view raw kafka_example.clj hosted with ❤ by GitHub

We use messages to record the products that are shown across our site, the searches that people perform, emails that are sent (and bounced), web requests and more. In total it’s probably a few million messages a day.

Metadata and State

Kafka uses Zookeeper for various bits of meta-information, including tracking which messages have already been retrieved by a consumer. To that end, it is the consumers responsibility to track consumption- not the broker. Kafka’s client library already contains a Zookeeper consumer that will track the message offsets that have been consumed.

As an side, the broker keeps no state about any of the consumers directly. This keeps it simple and means that there’s no need for complex structures kept in memory reducing the need for garbage collections.

When messages are received they are written to a log file (well, handed off to the OS to write) named after the topic; these are serial append files so individual writes don’t need to block or interfere with each other.

When reading messages consumers simply access the file and read data from it. It’s possible to perform parallel consumption through partitioned topics although this isn’t something we’ve needed yet.

Topic and message storage

Messages are tracked by their offset- letting consumers access from a given point into the topic. A consumer can connect and ask for all messages that Kafka has stored currently, or from a specified offset. This relatively long retention (compared to other messaging systems) makes Kafka extremely useful to support both real-time and batch reads. Further, because it takes advantage of disk throughput it makes it a cost-effective system too.

The broker can be configured to keep messages up to a specified quantity or for a set period of time. Our broker is configured to keep messages for up to 20 days, after that and you’ll need to go elsehwere (most topics are stored on HDFS afterwards). This characteristic that has made it so useful for us- it makes getting data out of applications and servers and into other systems much easier, and more reliable, than periodically aggregating log files.

Performance

Kafka’s performance (and the design that achieves it) is derived from the observation that disk throughput has outpaced latency; it writes and reads sequentially and uses the operating system’s file system caches rather than trying to maintain its own- minimising the JVM working set, and again, avoiding garbage collections.

The plot below shows results published within an ACM article; their experiment was to measure how quickly they could read 4-byte values sequentially and randomly from different storage.

Performance

Please note the scale is logarithmic because the difference between random and sequential is so large for both SSD and spinning disks.

Interestingly, it shows that sequential disk access, spinning or SSD, is faster than random memory access. It also shows that, in their tests, sequential spinning disk performance was higher than SSD.

In short, using sequential reads lets Kafka get performance close to random memory access. And, by keeping very little in the way of metadata, the broker can be extremely lightweight.

If anyone is interested, the Kafka design document is very interesting and accessible.

Batch Load into HDFS

As I mentioned earlier, most topics are stored on HDFS so that we can maximise the amount of analysis we can perform over time.

We use a Hadoop job that is derived from the code included within the Kafka distribution.

The process looks a little like this:

Hadoop Loading

Each topic has a directory on HDFS that contains 2 further subtrees: these contain offset token files and data files. The input to the Hadoop job is an offset token file which contains the details of the broker to consume from, the message offset to read from, and the name of the topic. Although it’s a SequenceFile the value bytes contain a string that looks like this:

broker.host.com topic-name  102991

The job uses a RecordReader that connects to the Kafka broker and passes the message payload directly through to the mapper. Most of the time the mapper will just write the whole message bytes directly out which is then written using Hadoop’s SequenceFileOutputFormat (so we can compress and split the data for higher-volume topics) and Hadoop’s MultipleOutputs so we can write out 2 files- the data file and a newly updated offset token file.

For example, if we run the job and consume from offset 102991 to offset 918280, this will be written to the offset token file:

broker.host.com topic-name  918280

Note that the contents of the file is exactly the same as before just with the offset updated. All the state necessary to perform incremental loads is managed by the offset token files.

This ensures that the next time the job runs we can incrementally load only the new messages. If we introduce a bug into the Hadoop load job we can just delete one or more of the token files to cause the job to load from further back in time.

Again, Kafka’s inherent persistence makes dealing with these kinds of HDFS loads much easier than dealing with polling for logs. Previously we’d used other databases to store metadata about the daily rotated logs we’d pulled but there was lots of additional computation in splitting apart files that would span days- incremental loads with Kafka are infinitely cleaner and efficient.

Kafka has helped us both simplify our data collection infrastructure, letting us evolve and grow it more flexibly, and provided the basis for building real-time systems. It’s extremely simple and very easy to setup and configure, I’d highly recommend it for anyone playing in a similar space.

Related Stuff

As I publish this LinkedIn have just announced the release of Camus: their Kafka to HDFS pipeline. The pipeline I’ve described above was inspired by the early Hadoop support within Kafka but has since evolved into something specific for use at uSwitch.

Twitter also just published about their use of Kafka and Storm to provide real-time search.

I can also recommend reading “The Unified Logging Infrastructure for Data Analytics at Twitter” paper that was published late last year.

Finally, this post was based on a brief presentation I gave internally in May last year: Kafka a Little Introduction

pingles

Blog of Paul Ingles

Tag data

Amazon Redshift + R: Analytics Flow

Redshift in R

Flow

Disclosure

Kafka for uSwitch's Event Pipeline

Messaging

Metadata and State

Performance

Batch Load into HDFS

Related Stuff