The multi-armed (also often referred to as K-armed) bandit problem models the problem a gambler faces when attempting maximise the reward from playing multiple machines with varying rewards.
For example, let’s assume you are standing in front of 3 arms and that you don’t know the rate at which they will reward you. How do you set about the task of pulling the arms to maximise your cumulative reward?
It turns out there’s a fair bit of literature on this topic, and it’s also the subject of a recent O’Reilly book: “Bandit Algorithms for Website Optimization” by John Myles White (who also co-wrote the excellent Machine Learning for Hackers).
This article discusses my implementation of the algorithms which is available on GitHub and Clojars: https://github.com/pingles/clj-bandit.
I was happy to attend this year’s clojure-conj and started reading through the PDF whilst on the flight out. Over the next few evenings, afternoons and mornings (whenever I could squeeze in time) I spent some time hacking away at implementing the algorithms in Clojure. It was great fun and I pushed the results into a library: clj-bandit.
I was initially keen on implementing the same algorithms and being able to reproduce the results shown in the book. Since then I’ve spent a little time tweaking parts of the code to be a bit more functional/idiomatic. The bulk of this post covers this transition.
I started with a structure that looked like this:
From there I started layering on functions that select the arm with each algorithm’s
select-arm implemented in its own namespace.
One of the simplest algorithms is Epsilon-Greedy: it applies a fixed probability when deciding whether to explore (try other arms) or exploit (pull the currently highest-rewarding arm).
The code, as implemented in clj-bandit, looks like this:
We generate a random number (between 0 and 1) and either pick the best performing or a random item from the arms.
In my initial implementation I kept algorithm functions together in a protocol, and used another protocol for storing/retrieving arm data. These were reified into an ‘algorithm’:
select-arm to the current arm state would select the next arm to pull. Having pulled the arm,
update-reward would let the ‘player’ track whether they were rewarded or not.
This worked, but it looked a little kludgey and made the corresponding monte-carlo simulation code equivalently disgusting.
I initially wanted to implement all the algorithms so I could reproduce the same results that were included in the book but the resulting code definitely didn’t feel right.
After returning from the conference I started looking at the code and started moving a few functions around. I dropped the protocols and went back to the original datastructure to hold the algorithm’s current view of the world.
I decided to change my approach slightly and introduced a record to hold data about the arm.
The algorithms don’t need to know about the identity of any individual arm- they just need to pick one from the set. It tidied a lot of the code in the algorithms. For example, here’s the
select-arm code from the UCB algorithm:
The cool part about the book and it’s accompanying code is that includes a simulation suitable for measuring the performance and behaviour of each of the algorithms.
This is important because the bandit algorithms have a complex feedback cycle: their behaviour is constantly changing in the light of data given to them during their lifetime.
For example, following from the code by John Myles White in his book, we can visualise the algorithm’s performance over time. One measure is accuracy (that is, how likely is the algorithm to pick the highest paying arm at a given iteration) and we can see the performance across algorithms over time, and according to their exploration/exploitation parameters, in the plot below:
The simulation works by using a series of simulated bandit arms. These will reward randomly according to a specified probability:
We can model the problem neatly by creating a function representing the arm, we can then pull the arm by applying the function.
As I mentioned earlier, when the code included protocols for algorithms and storage, the simulation code ended up being pretty messy. After I’d dropped those everything felt a little cleaner and more Clojure-y. This felt more apparent when it came to rewriting the simulation harness.
clj-bandit.simulate has all the code, but the key part was the introduction of 2 functions that performed the simulation:
simulation-seq creates a sequence through
simulate is passed a map that contains the current state of the algorithm (the performance of the arms), it returns the updated state (based on the pull during that iteration) and tracks the cumulative reward. Given we’re most interested in the performance of the algorithm we can then just
(map :result ...) across the sequence. Much nicer than nested
At uSwitch we’re interested in experimenting with multi-armed bandit algorithms. We can use simulation to estimate performance using already observed data. But, we’d also need to do a little work to consume these algorithms into web applications.
There are existing Clojure libraries for embedding optimisations into your Ring application:
These both provide implementations of the algorithms and for storing their state in stores like Redis.
I chose to work through the book because I was interested in learning more about the algorithms but I also like the idea of keeping the algorithms and application concerns separate.
Because of that I’m keen to work on a separate library that makes consuming the algorithms from clj-bandit into a Ring application easier. I’m hoping that over the coming holiday season I’ll get a chance to spend a few more hours working on it.