Amazon Redshift + R: Analytics Flow

Ok, so it’s a slightly fanboy-ish title but I’m starting to really like the early experimentation we’ve been doing with Amazon’s Redshift service at uSwitch.

Our current data platform is a mix of Apache Kafka, Apache Hadoop/Hive and a set of heterogenous data sources mixed across the organisation (given we’re fans of letting the right store find it’s place).

The data we ingest is reasonably sizeable (gigabytes a day); certainly enough to trouble the physical machines uSwitch used to host with. However, for nearly the last 3 years we’ve been breaking uSwitch’s infrastructure and systems apart and it’s now much easier to consume whatever resources you need.

Building data systems on immutable principles also makes this kind of experimentation so much easier. For a couple of weeks we (Paul and I) have been re-working some of our data warehousing ETL to see what a Redshift analytics world looks like.

Of course it’s possible to just connect any JDBC SQL client to Redshift but we want to be able to do some more interactive analysis on the data we have. We want an Analytics REPL.

Redshift in R

I’m certainly still a novice when it comes to both statistical analyses and R but it’s something I’m enjoying- and I’m lucky to work with people who are great at both.

R already has a package for connecting to databases using JDBC but I built a small R package that includes both the Postgresql 8.4 JDBC driver and a few functions to make it nicer to interact with: Redshift.R. N.B. this was partly so I could learn about writing R packages, and partly about making it trivial for other R users in the company to get access to our experimental cluster.

The package is pretty easy to install- download the tarball, uncompress and run an R statement. The full instructions are available on the project’s homepage. Once you’ve installed it you’re done- no need to download anything else.

Flow

What I found really interesting, however, was how I found my workflow once data was accessible in Redshift and directly usable from inside my R environment; the 20 minute lead/cycle time for a Hive query was gone and I could work interactively.

I spent about half an hour working through the following example- it’s pretty noddy analytics but shows why I’m starting to get a little excited about Redshift: I can work mostly interactively without needing to break my work into pieces and switch around the whole time.

Disclosure

It would be remiss of me not to mention that R already has packages for connecting to Hadoop and Hive, and work to provide faster querying through tools like Cloudera’s Impala. My epiphany is probably also very old news to those already familiar with connecting to Vertica or Teradata warehouses with ODBC and R.

The killer thing for me is that it cost us probably a few hundred dollars to create a cluster with production data in, kick the tyres, and realise there’s a much better analytics cycle for us out there. We're really excited to see where this goes.

38 responses

This is exactly what I needed! Thanks for doing the work!

— Tiffany Washburn

A posthaven user upvoted this post.

Hi, Excellent Article! I, however am facing one specific issue. Using redshift.query, I was able to fetch (some part) of my otherwise huge dataset from the Amazon Cluster. Is there any way by which I can speed up the data pulls from the cluster into R, may be like using a function ? Alternatively, I tried using fread() from data.table package, along with redshift functions to speed load data but it didn't work. Can you please provide a work around ? Eagerly awaiting your reply. Thanks & Regards YB

— Yashodhan Bhatt

35 visitors upvoted this post.

pingles

Blog of Paul Ingles