uSwitch.com uses Riemann to monitor the operations of the various applications and services that compose our system. Riemann helps us aggregate systems and application metrics together and quickly assemble dashboards to understand how things are behaving now.
Most of these services and applications are written in Clojure, Java or Ruby, but recently we needed to monitor a service we implemented with Go.
Hopefully this is interesting for Go authors that haven’t come across Riemann or Clojure programmers who’ve not considered deploying Go before. I think Go is a very interesting option for building systems-type services.
Recently at uSwitch the team I work on started replacing a small but core service from a combination of Java and Clojure programs to Go.
We’ve done this for probably 3 main reasons:
- Deploying a statically-linked binary with few other runtime dependencies (such as a JVM) makes deployment simpler and the process overhead smaller.
- Go’s systems roots makes it feel like your programming closer to the OS. Handling socket errors, process signals etc. They all feel better than writing through the Java (or other) abstractions.
- It also gives us the chance to experiment with something different in a small, contained way. This is the way I/we have been building systems for the last few years at uSwitch and for 7 or 8 years prior at TrafficBroker and Forward: we help systems emerge and evolve from smaller pieces. It provides a way to safely, and rapidly, learn about the best way to apply tools and languages in different situations through experimentation.
Riemann is a wonderful tool for monitoring the performance and operational behaviour of a system.
We’ve gone from using it to just aggregate health checks and alerting to running live in-production experiments to understand the best value EC2 hardware to run the system on. It’s proved both versatile and valuable.
For some of the services we’ve built Riemann is the primary way to see what the application is doing. We no longer build application-specific dashboards or admin interfaces. For example, our open-source systems Blueshift and Bifrost do this.
Getting data into Riemann
Riemann is really just an event stream processor: it receives incoming events and applies functions to produce statistics and alerts.
Although it’s possible for applications to write events directly- Riemann uses Protocol Buffers and TCP/UDP directly so its relatively easy to build clients. We’ve found it far easier, however, to use Metrics- a Java library that provides some slightly higher-level abstractions: Timers, Counters and more making it much cleaner and easier to integrate.
Although the original Metrics library was written for Java, an equivalent has been written in Go: go-metrics. We’ve contributed a Riemann reporter which is currently sitting inside a pull request, until then you can also use my fork with Riemann support.
Go Example reporting metrics to Riemann
Let’s consider an example to help understand how it can be helpful.
We’re building a service which will listen for TCP connections, perform some operation and then close the connection. One of the most obvious useful things we can monitor is the number of open socket connections we have. This might be useful for monitoring potential resource exhaustion, performance degradation or, just to see how many clients you have connected.
Our example service will: accept a connection, say hello, sleep for 5 seconds, close the connection. We’ll use a Counter metric that we’ll increment and decrement when clients open and close connections.
connCounter and register it in the
metrics.DefaultRegistry before getting on with our work. As our program runs the registry stores the metric values. After registering our metric we start a goroutine that will periodically report all metric values to a Riemann instance. Note that our program imports my fork of go-metrics.
handleConnection function is responsible for servicing connected clients. It first increments the
connCounter and then defers a decrement operation so we’ll decrement when our function returns.
This is a pretty trivial example, most of our services expose multiple different metrics recording things like counters for the number of open files, timers for the speed we can service requests (which will automatically calculate summary statistics), meters for whenever we experience an error etc.
Running the example
Although Riemann supports many different outputs we frequently use riemann-dash which lets you easily create (and save) custom dashboards with graphs, logs, gauges and more.
To help see what our example service does when its run I’ve recorded a small video that shows us running the service and a
riemann-dash plot showing the number of open connections. After we’ve started the service we connect a couple of clients with telnet, the plot moves as connections are opened (and closed).