- Frequentist statistics
- Bayesian statistics
- Revenue and Other Non-Binomial Metrics
Bookmark!
Feel free to bookmark this post. We are constantly adding new and highly relevant information.Frequentist vs Bayesian
No, we are not going to debate on which one you should prefer. There is a bunch of articles and threads on the pros and cons of each. Long story short - there is no right and wrong, go with whatever your A/B testing tool of choice is using and get to know how it works. In the broadest sense, A/B testing is detecting which variation of your website has the highest probability of performing the best based on set criteria (actually it's a bit more complicated but more on that later in the post). We can determine four types of probability.- Long-term frequencies
- Physical tendencies/propensities
- Degrees of belief
- Degrees of logical support
Frequentist approach
Tools using frequentist-type statistics- Optimizely (leveraging some Bayesian wisdom)
- Convert
Null hypothesis
A default position where conversion rate for control is equal to a conversion rate for a variation, there is no significant difference. Usually presented as H0.Alternative hypothesis
An alternative hypothesis is a statement that is being tested against the null hypothesis is. Often presented as an H1 or Ha. In A/B testing, an alternative hypothesis generally claims that a certain variation is performing significantly better than the control.P-value
One of the most commonly misunderstood concepts of A/B testing and in statistics in general. Generally, people new to A/B testing tend to believe it describes the probability of a variation being a winner or a loser, compared to the control. That, of course, is not true. P-value describes a probability of observing the observed (or greater) difference between the control and variation if you assume that your hypothesis is wrong and there should actually be no difference between the groups. Another way to put this is that p-value describes the probability of seeing the observed difference randomly, say, in an A/A test. P-value is one of the most important metrics in Frequentist statistics, therefore, it is not about detecting the probability of your null or alternative hypothesis of being true or false nor is it detecting the probability of your variation being better than the control. It is about rejecting or not rejecting the null hypothesis. Here's how it goes:- You come up with some reasonable threshold for rejecting the null hypothesis. The notation used for this threshold is α (the Greek letter alpha). This threshold is a real number between 0 and 1 (usually very close to 0).
- You promise to yourself in advance that you will reject the null hypothesis if the calculated p-value happens to be below α (and not reject it otherwise).
Type I error
False positive - Rejecting a true null hypothesis. In A/B testing this would mean that you call your variation a winner when it actually is worse, equal or less good than your test showed. So, the p-value, along with the prespecified α, directly controls the type I (false positive) error rate.Rejecting a null hypothesis (calling your variation a winner) is a result that triggers an action, most commonly ending in a change on your website (implementing your variation). Therefore, you must think wisely about what is the per cent of such false decisions that you can live with!
Type II error
False negative - Not rejecting a false null hypothesis. In A/B testing it usually describes a situation where your variation is a winner but your test shows it is not significantly better than the control.Statistical power
Simply put, statistical power is the probability that you will reject a null hypothesis if it is false.The false negative rate will depend on these 3 factors:Source: An Intuitive Explanation Of P-ValuesIn the real world, you only have control over the last factor, so you see why controlling the type II error is much trickier.
- The size of the actual difference between the groups (which, by definition, is nonzero when the null hypothesis is false)
- The variance of the data with which you’re testing the null hypothesis
- The number of data points (your sample size)
Confidence intervals
Confidence intervals are the frequentist way of doing parameter estimation. The technical details behind calculating and interpreting confidence intervals are beyond the scope of this post, but I’m going to give you the general overview. Once you’ve calculated a confidence interval using 95% confidence level, it’s incorrect to say that it covers the true mean with a probability of 95% (this is a common misinterpretation). You can only say in advance that, in the long-run, 95% of the confidence intervals you’ve generated by following the same procedure will cover the true mean. In A/B testing, let's say you've calculated a confidence interval using a 95% confidence level for the conversion rate of a given variation [i.e 4.5% - 5.8%]. This means that after implementing this variation, its conversion rate will between 4.5% and 5.8% with 95% confidence.Two-tailed test
If you are using a significance level of 0.05, a two-tailed test allots half of your alpha to testing the statistical significance in one direction and half of your alpha to testing statistical significance in the other direction. This means that .025 is in each tail of the distribution of your test statistic. When using a two-tailed test, regardless of the direction of the relationship you hypothesize, you are testing for the possibility of the relationship in both directions. A two-tailed test will test both if the mean is significantly greater than x and if the mean significantly less than x. The mean is considered significantly different from x if the test statistic is in the top 2.5% or bottom 2.5% of its probability distribution, resulting in a p-value less than 0.05.
One-tailed test
If you are using a significance level of .05, a one-tailed test allots all of your alpha to testing the statistical significance in the one direction of interest. This means that .05 is in one tail of the distribution of your test statistic. When using a one-tailed test, you are testing for the possibility of the relationship in one direction and completely disregarding the possibility of a relationship in the other direction. A one-tailed test will test either if the mean is significantly greater than x OR if the mean is significantly less than x, but not both. Then, depending on the chosen tail, the mean is significantly greater than or less than x if the test statistic is in the top 5% of its probability distribution or bottom 5% of its probability distribution, resulting in a p-value less than 0.05. The one-tailed test provides more power to detect an effect in one direction by not testing the effect in the other direction.
In frequentist A/B testing, we use p-values to choose between two hypotheses: the null hypothesis — that there is no difference between variants A and B — and the alternative hypothesis — that variant B is different. A p-value measures the probability of observing a difference between the two variants at least as extreme as what we actually observed, given that there is no difference between the variants. Once the p-value achieves statistical significance or we’ve seen enough data, the experiment is over.Source
Bayesian approach
Tools using Bayesian-type statistics- Google Optimize
- VWO
- Adobe Target
- AB Tasty
- Dynamic Yield
- Bayesian gets reliable results faster (with a smaller sample)
- Bayesian results are easier to understand for people without the background in statistics (Frequentist results are often misinterpreted)
- Bayesian is better at detecting small changes (Frequentist favoring the null hypothesis).
Important variables of Bayesian testing:
α - underlying and unobserved true metric for variant A β - underlying and unobserved true metric for variant B Therefore, If we choose variant A when α is less than β, our loss is β - α. If α is greater than β, we lose nothing. Our loss is the amount by which our metric decreases when we choose that variant. ε - the threshold of expected loss for one of the variants, under which we stop the experimentThis stopping condition considers both the likelihood that β — α is greater than zero and also the magnitude of this difference. Consequently, it has two very important properties:Source Prior - one of the key differences between Frequentist and Bayesian is that the latter can take prior information into account. Hence, it doesn't have to learn all the data points itself and can, therefore, reach the conclusions faster.
- It treats mistakes of different magnitudes differently. If we are uncertain about the values of α and β, there is a larger chance that we might make a big mistake. As a result, the expected loss would also be large.
- Even when we are unsure which variant is larger, we can still stop the test as soon as we are certain that the difference between the variants is small. In this case, if we make a mistake (i.e., we choose β when β < α), we can be confident that the magnitude of that mistake is very small (e.g. β = 10% and α = 10.1%). As a result, we can be confident that our decision will not lead to a large decrease in our metric.
For example, let’s say we use a Beta(1, 1) distribution as the prior for a Bernoulli distribution. After observing 40 successes and 60 failures, our posterior distribution is a Beta(41, 61)⁶. However, if we had started with a Beta(8, 12) distribution as our prior, we would only need to observe 32 successes and 48 failures in order to obtain the same distribution as before.Source In general, it is suggested to choose priors that are a bit weaker than what the historical data suggest. Most Bayesian-based A/B testing tools, like VWO, present their results using three key metrics
- Relative improvement VS control - a range by which the observed metric for the variation is better or worse than the same metric for the control. The range is calculated for a 99% probability. The more data the test collects, the smaller this range gets.
- Absolute potential loss - the potential loss is the lift you can lose out on if you deploy A as the winner when B is actually better.
- Chance to beat control/all - probability of the variation being better than the control/all other variations.

| Quick learning For finding quick trends where tests don’t affect your revenue directly | You can choose this mode when testing non-revenue goals such as the bounce rate and time spent on a page or for quick headline tests. With this mode, you can reduce your testing time for non-critical tests when there isn’t a risk of hurting your revenue directly by deploying a false winner. |
| Balanced Ideal for most tests. | As the name suggests, it is the best balance between the testing time and minimizing the potential loss. |
| High certainty Best for revenue-critical tests when you want to absolutely minimize the potential loss. Usually takes the longest to conclude a test. | This is the default mode and can be used for almost all tests. Suppose you have an eCommerce website and you want to test changes to your checkout flow. You want to be as certain as possible to minimize the potential loss from deploying a false winner even if it takes a lot of time. This is the best mode for such critical tests which affect your revenue directly. |
Working with Revenue and Other Non-Binomial Metrics
How you (or the machine) calculate the results for an A/B test depends heavily on whether you are testing a binomial or non-binomial metric. Here are some common non-binomial metrics used in A/B testing:- Average order value
- Average revenue per user
- Average sessions per user
- Average session duration
- Average pages per session
- Testing Differences in Revenue? You’re Probably Not Using the Correct Statistics
- Your Average Revenue Per Customer is Meaningless
BTW, if you have some experience with Python, setting one up for yourself is not too difficult. Here's what you'll need.Another option is to use the following process suggested by Georgi Georgiev in his blog post.
- Extract user-level data (orders, revenue) or session-level data (session duration, pages per session) or order-level data (revenue, number of items) for the control and the variant
- Calculate the sample standard deviation of each
- Calculate the pooled standard error of the mean
- Use the SEM in any significance calculator / software that supports the specification of SEM in calculations
Statistics plays a huge role in A/B testing and it is absolutely a must to know at least the basics of Frequentists, Bayesian and non-binomial metrics. That way you can choose the right tools and, hopefully, learn to know them (and stats they use) in depth. I hope this post gave you a good starting point and hopefully you learned something new. Did we miss something important? Suggestions are welcome in the comments below - so are the questions.
A truly helpful article 🙂
In our team, we are working on calculating confidence intervals for a non-binomial metric like Average Revenue Per User. Would you happen to know a good calculator for this? Or any literature that throws some light on this?
Hey Soorya,
Thank you for the comment.
This calculator should be a good starting point: https://www.convert.com/calculator/revenue-per-visitor/
To dig deeper, I’d recommend you look at the Mann-Whitney-Wilcoxon rank-sum test in Python or R.
Silver
I feel like a “comprehensive guide” would include at least one formula for sample size
Can you recommend a good calculator for A/B testing statistics?
I recommend this one from CXL https://conversionxl.com/ab-test-calculator/