Significance Irrelevant in Online Experiments

Earlier this week I heard about Google Analytics’ new Content Experiments feature. The help also includes some interesting information about the engine used to run the experiments.

I did some further reading from the references and came across A Modern Bayesian Look at the Multi-armed Bandit by S. L. Scott.

The author of the paper makes a very interesting argument as to why multi-armed bandits suit online experiments better than classic null-hypothesis experiment methods: classic experiments control for the wrong type of error.

In a classical experiment we would randomise people into two groups, expose the two groups to different treatments, and finally collect data to measure the effect. We then use statistics to determine whether we have statistically significant evidence to reject the null hypothesis.

Scott argues that such experiments “are designed to be analyzed using methods that tightly control the type-I error rate under a null hypothesis of no effect”.

Type-I vs. Type-II Error

A Type-I error is the incorrect rejection of a true null hypothesis. That is, picking a treatment that isn’t materially better. In contrast, a Type-II error is “failing to reject a false null hypothesis”. That is, failing to switch to a better treatment.

Scott argues that in experiments with low switching costs, such as in online optimisation, type-II errors are more costly than type-I errors. In other words, it’s far worse for us to fail to try a better treatment than it is to incorrectly pick a treatment that isn’t materially better.

In short Scott states that, for online experiments, “the usual notion of statistical significance [is] largely irrelevant”.