If your A/B test has failed to meet statistical relevance, you may be working on assumptions that are untrue, or somewhat misleading. Be sure to find out for sure before you put your Sitecore DMS A/B results into action.
We have looked at a number of possible points of caution to take note of when conducting A/B tests, including the three mistakes to avoid and taking personas into account. But, as many marketers tend to forget, an A/B test is only as effective as its sample size and statistical relevance.
Have you reached statistical significance today?
It can be tempting to run a test for a day or a week, and based on a noticeable difference in conversion, pick a winner. However, in doing this the chance that your decision will actually stand up over time is quite low. Ultimately you need to know when your test has collected enough data as to be a realistic prediction of user behaviour or, in math geek parlance, the test is significant.
There are volumes of academic texts filled with guidelines for determining the appropriate sample size for statistical relevance. While we don’t need to wade into those deep waters, there are two key decisions you do need to make. The first is what confidence level you want to have in your test. Analytics guru Avinash Kaushik’s recommends that you shoot for a 95% confidence level and who are we to disagree?
The second thing to keep in mind: the smaller the change in conversion you are expecting to see, the larger the sample size you will need. This means that your first tests of the “obvious” improvements require a relatively small test size, while later micro-improvements will need a greater level of testing.
Statistical significance in Sitecore
The method for determining statistical significance varies depending on what software or program you are using to conduct your A/B tests, and as we are big fans of Sitecore, we will use their Digital Marketing System for our example. If you have worked with Sitecore in the past, its possible that you have encountered issues with this particular analysis, but never fear, we are here to help.
In this screenshot from the DMS all you can see are the components and the engagement value that have been delivered. If we know the values assigned – in this case we assume a value of 1 – then we know that Option B had 60 conversions and C had 20. If we assume that each of the three options was shown an equal number of times and the source page was delivered a total of 4,000 times we will see that conversions are:
Using this data we can perform a one or two tail test on the experiment to determine a few different things: if you are looking to prove that C is better than B, use a one tail. If you are simply looking to confirm that C is different than B use the two tail test.
- The pooled sample p = (20 + 60) / (2,000 + 2,000) = 2.00%
- The standard error SE = 0.0044
- Your Excel formula is =SQRT(p*(1-p)*(1/n_1+1/n_2)))
- The value of t works to 4.5175
The t-value in this equation signifies the probability of the sample size is significantly different enough between the two conditions to be statistical significant. The value of this particular t, 4.5175, shows the experiment to be highly significant using both one and two tail t-tests.
Statistical relevance is important; so is context
A word of caution: while statistical significance is important, if used improperly it can give you and your stakeholders a false confidence in your results. For example, if you know there are seasonal variations in your user behaviors keep in mind that a statistically valid test of users in the summer months may not apply to users in the winter months; understand what sample of your population a test is reaching.
As we have shown, there are many misattribution errors that can occur when conducting A/B tests. If we have piqued your interest in undertaking some A/B testing of your own, but are a little stumped by all of details, get in touch and we can help you figure out what you need to know to start seeing serious A/B testing results.