Choosing the Right Statistical Test for A/B Testing: A Comprehensive Guide¶

Estimated time to read: 10 minutes

A/B testing is a popular method for comparing two versions of a website, application, or other product to see which performs better. It's a way to test changes to your page against the current design and determine which one produces better results. When conducting A/B tests, it's crucial to understand the statistical tests necessary to determine the significance of the results. These tests, in turn, are influenced by factors like the type of data (metric), sample size, and sampling distribution. An A/B test is successful when you are able to collect enough data to come up with a conclusion. It is irrelevant if your customer chooses or not version A or B.

Typical steps for a successful A/B test¶

Define the Goal¶

The first step in any A/B test is to identify what you're trying to achieve. This might be increasing click-through rates, improving conversion rates, reducing bounce rates, or anything else that's critical to the success of your website or application. The goal should be Specific, Measurable, Attainable, Relevant, and Time-bound (SMART).

Formulate Hypotheses¶

Once you've defined your goal, you need to formulate hypotheses about what changes might help you achieve that goal. A hypothesis should be an informed assumption that you make about what you think will improve your metric of interest. For instance, you might hypothesise that changing the colour of a call-to-action button from blue to green will increase click-through rates.

Identify Variables¶

Identify the variables you're going to change in your A/B test. In the example above, the variable is the colour of the call-to-action button. The version with the original button colour is typically referred to as the "control," and the version with the changed button colour is the "variant."

Design and Implement the Test¶

Using an A/B testing platform, create the variant(s) by changing the identified variable. The platform will then randomly assign users to either the control or the variant group, ensuring that each user sees a consistent experience (either always the control or always the variant).

Run the Test¶

Let the test run for a pre-determined amount of time or until you've reached a sufficient number of participants for your test to reach statistical significance. This could be days, weeks, or even months, depending on the amount of traffic your site receives and the magnitude of the difference between the variant and the control.

Collect and Analyze Data¶

As users interact with either the control or variant version of your site, collect data on their behaviour. Use statistical analysis to determine whether the differences you observe are statistically significant, i.e., unlikely to have occurred by chance.

Draw Conclusions: Based on the results of your analysis, draw conclusions about the effect of the variable you changed. If the variant performed significantly better than the control, you might choose to implement the change permanently. If not, you might decide to keep the original version or to test a different variant.

Repeat¶

A/B testing is an ongoing process. Even after finding a winning variant, continue to come up with new hypotheses to test. This continuous improvement process can help you optimise your site's performance or your product or your communication with your customers over time.

Remember, the key to a successful A/B test is patience and persistence. Don't be discouraged if your first few tests don't produce the results you're hoping for. Keep trying different variables and learning from each test, and over time you'll find the changes that make a difference.

Hypothesis¶

When conducting a hypothesis test in statistics, such as an A/B test, we typically begin with a null hypothesis (often denoted as H0) and an alternative hypothesis (often denoted as H1 or Ha).

null hypothesis That is a statement that there's no effect or no difference in the population. In an A/B test, this might be something like "There's no difference in click-through rates between the control and variant designs."
alternative hypothesis That is the opposite of the null hypothesis. In the A/B test example, it might be "The variant design has a different click-through rate than the control design."

The significance level, often denoted as α (alpha), is a threshold that we choose to decide when we will reject the null hypothesis. Common choices for α are 0.05 (5%) and 0.01 (1%). By setting α = 0.05, for example, we're saying that we're willing to accept a 5% chance of wrongly rejecting the null hypothesis. This is known as a Type I error or a "false positive."

When we perform the hypothesis test, we compute a p-value. The p-value is the probability of observing a test statistic as extreme as, or more extreme than, the one calculated from your sample data, assuming that the null hypothesis is true.

If the p-value is less than or equal to α, we say that the result is statistically significant, and we reject the null hypothesis in favor of the alternative hypothesis. In the context of an A/B test, this might mean concluding that the variant design does indeed have a better click-through rate than the control.
If the p-value is greater than α, we say that the result is not statistically significant, and we do not reject the null hypothesis. In the A/B test, this would mean concluding that we don't have enough evidence to say the variant design has a better click-through rate than the control.

It's important to note that failing to reject the null hypothesis doesn't prove the null hypothesis true. It simply means that we don't have enough evidence to support the alternative hypothesis.

Lastly, it's worth mentioning that while p-values and the significance level are widely used in hypothesis testing, they are often misinterpreted and can be misleading. They should be considered in the context of other evidence and not used as the sole basis for decision-making. There are additional crucial factors to consider, such as the practical significance of the findings, the power of the test, and the risk of Type II errors. Type II errors occur when the null hypothesis is falsely accepted, failing to reject it despite it being false, and are commonly referred to as "false negatives."

Metrics¶

Metrics in A/B testing are the measurements you use to determine the success or failure of a test. They could range from click-through rates, conversion rates, average order value, time spent on a site, bounce rates, and more. The kind of metric you choose determines the type of data you will be dealing with and which, consequently, influences the statistical test you'll use.

Continuous Data These are numerical data that can take any value within a finite or infinite interval. Examples include time spent on a site or average order value. In this case, a t-test or ANOVA can be used depending on the number of variants being tested.

Categorical Data This type of data can be sorted according to category. An example is the click-through rate, where the data is either 'clicked' or 'not clicked'. The chi-square test or Fisher's exact test (for small sample sizes) can be used here.

Sample Sizes¶

The sample size refers to the number of observations or individuals in any statistical setting. In the context of A/B testing, the sample size is usually the number of users included in the test. The right sample size is essential to attain statistical significance. If it's too small, the test may not have enough power to detect a difference, even if one exists. Large Sample Sizes When the sample size is large enough, we can use tests like the z-test or chi-square test, which rely on the Central Limit Theorem.

Small Sample Sizes For small sample sizes, we might need to use different tests, such as the t-test for continuous data or Fisher's exact test for categorical data.

Sampling Distributions¶

The sampling distribution of a statistic refers to the distribution of that statistic when treated as a random variable and computed from a random sample of size n. This distribution can exhibit various shapes, such as skewness, normality, binomial characteristics, and so on.

Normal Distribution If the sampling distribution of the metric follows a normal distribution, we can use parametric tests like the t-test or z-test.

Non-Normal Distribution If the distribution is not normal (e.g., it's highly skewed), non-parametric tests like the Mann-Whitney U test or the Kruskal-Wallis test might be more appropriate.

Now, let's visualise these possibilities in a table:

	Continuous Data	Categorical Data
Large Sample Size	t-test / z-test / ANOVA	Chi-square test
Small Sample Size	t-test	Fisher's exact test
Normal Distribution	t-test / z-test	Chi-square test
Non-Normal Distribution	Mann-Whitney U / Kruskal-Wallis	-

*This table is a simplification and only includes some of the most common scenarios. The right test can depend on other factors as well.

Conclusion¶

Choosing the right statistical test for your A/B testing is crucial to determine the success or failure of your test accurately. Understanding the type of data, sample size, and sampling distribution are key to this process. Remember, the goal is to ensure that the conclusions drawn from the test are reliable and reproducible.

For binary or categorical data, you can use logistic regression to model the probability of success given some explanatory variable. This type of analysis is particularly effective when dealing with categorical data, which cannot be normally distributed and therefore require specialised tests.

When considering sample size, remember that larger sample sizes give your test more power, increasing the chances of detecting an effect if one truly exists. The trade-off here is the time and resources it takes to gather a large sample. To reach a standard 80% power level for testing, you generally need a large sample size, a large effect size, or a longer duration test.

Regarding sampling distribution, understanding the concept of p-values and confidence intervals is crucial. A p-value is a measure of evidence against the null hypothesis. It answers the question, "How surprising is this result?" If the result is within a "not surprising" area, then we fail to reject the null hypothesis. On the other hand, if the result falls in the "surprising" region, we reject the null hypothesis, which often implies that the change made in the test (B variation) has a significant effect.

Confidence intervals represent the range within which we expect the true population parameter (such as the conversion rate) to fall, given a certain level of confidence (typically 95%). They are a measure of the reliability of an estimate, helping to account for the inherent uncertainty when working with samples instead of the entire population. If the confidence intervals of two variations overlap, it's usually a sign that more testing is needed to reach a definitive conclusion.

Lastly, it's important to consider regression to the mean. This is a phenomenon where if a variable is extreme on its first measurement, it will tend to be closer to the average on its second measurement. This is why we often see wild fluctuations at the beginning of an A/B test. Over time, as more data is collected, these fluctuations typically decrease.

In conclusion, a table summarising the information above would look like this:

Data Type	Sample Size	Sampling Distribution	Recommended Statistical Test
Binary/Categorical	Small	Any	Chi-square, Fisher's exact test
Binary/Categorical	Large	Any	Logistic regression
Continuous	Small	Normal	T-test
Continuous	Large	Normal	Z-test
Continuous	Any	Non-normal	Mann-Whitney U test

Remember, this table is a simplification, and the specific circumstances of your A/B test could necessitate different approaches or tests. Always consider consulting with a statistician or data scientist when planning and interpreting your A/B tests.