Steps five to seven of the CRO process: Test launch, ongoing analysis and conclusion
Step 5: Test launch
So once the QA has done their job and made sure you are putting out quality work, you're then going to move to the test launch phase.
Launch on a throttle
So firstly, consider launching a test on a throttle. It's something that we would do in probably 99% of cases here. So a throttle is an artificial reduction of traffic, limiting the percentage of visitors who are eligible for a test from getting in. This is basically about risk management. If something does go wrong in live, despite all of your checking (it happens) it limits the exposure of that. And again, thinking back to that reliability element, things will go wrong at some stage within your testing. Being able to demonstrate that you've taken appropriate measures to limit the impact of that goes a long way to keeping some of the faith with what you're doing.
Repeat checks again
One of the best ways of working out whether that's happened is you repeat your visual, functional and tracking checks again, once the test is in live. We do this, both consultants here do this and developers here do this as well. The reason for that is that we want to police our own quality. We do not want to be in a position where somebody else spots that something has gone wrong. Again, this is about managing it ourselves and showing that there is a reliability and a solidity to what we're doing.
Release the throttle
And then lastly, once all the tracking is confirmed, once all those visual and functional checks are confirmed, then we look to release the throttle. If we find that anything is missing, even if it's one conversion point, if we find that that is missing, we will investigate it first before we release that throttle. The important bit here is a delay in releasing that throttle is better than releasing it with an element of breakage within that test. Again, here to police our own quality.
Test launch timeline
Day zero is the point where a test passes QA from the client and we pick a date and time to launch that test. We would not, under most conditions, launch a test later than 2pm, nor on a Friday. The point here is that if you release something to live, you need to be around to monitor it and make sure that it's working as you expect. Therefore, doing it too late in the afternoon is introducing a risk you don't need. And again, same for a Friday. You do not want to come into a message on Monday morning, which is that an element of the site has not functioned all weekend. It does not make you popular at all!. Putting these things in place does not mitigate every single risk. Things will go wrong at some point, but it's about having the conditions where you don't create risk that you do not need to.
Then on to day one. This is where the test is launched at a 25% throttle. Again, those checks that I talked about, visual, functional and tracking, those are all performed and if they pass, you will then be looking at releasing the throttle on the next day.
So basically we get the test live, we then wait until the next morning to confirm everything. If everything is okay, we'll then recommend that the throttle is released at that point.
That then means that day three for us represents the first full day of unthrottled traffic, therefore the first day of the full analysis period for that test. There will be more to come on exactly what that means in a little while.
Step 6: Ongoing Analysis
We have a timeline from which we start looking at the results of the test, which is not immediately after launch, but soon after. So the ongoing analysis period does start from the first full day on 100% of the traffic, or at least on the amount of traffic that you intend to be the complete amount of your audience.
Trends do not appear immediately. When you look at week one, you may start seeing trends emerge. This is basically the time when things are getting into place where enough users have been in but maybe you do need a second week.
And generally if your test has enough traffic in it, two weeks might actually show a solidification of the trends. But depending on traffic and the subtleness of the test, you might actually need more.
The important thing is to remember that whatever length of time you need in order to get your results, it is worth having the stabilisation of those results because you want to make sure that you do not recommend a change that's actually not solid and not something that would stay the same in the long run.
Analysis example - week 1 vs. week 2
So, looking at this from a numbers perspective with a concrete example. Here we've got a test that's been launched and we're looking at it at week one. We've got a decent amount of traffic into the test, about 5000 users in each experiment. You've got a metric that you're looking at, which is a click on a CTA. That metric clearly has a positive improvement, +1.20%.
Now, there is a very important number underneath that, which is the Chance to Beat Control. That is basically a definition of significance for your change. It's essentially the probability that the change is going to out-perform the control and this needs to be as close as possible to 100% for you to be sure.
Now, in week one we are only at 68%, which for us is a bit still marred in terms of significance. Anything under 80% we would think is not very solid. So we will definitely look at this trend again in week two.
And this time, as you can see, not only has the change improved, you got a higher percentage increase, +2.71%, but by this point you've also increased your Chance to Beat Control to 93%, which is really what we want. We want to be in the 90s plus for this.
So this is an illustration of where you have to be careful of the trends that you see in week one because they all have to relate to the level of significance that come with them and a second week is essentially going to really give you stronger significance and therefore it's worth waiting for.
Step 7: Conclusion
So when you look to the conclusion of your test, there are a number of things that you want to consider. So firstly, has the test been live for a minimum of 14 full days of unthrottled traffic? And is the time period a multiple of seven days? Next, have you got at least 100 conversions per conversion point per experiment, in particular, looking at whatever the primary metric you set in your test plan is, that's where you're looking for that kind of level. And then lastly, are you happy with the statistical basis on which you're going to make a decision here? And I'll explain some of those terms in more detail.
Why unthrottled traffic & multiples of 7 days?
For a large part this is about projectability. If you want to give an indication of what the value or future value of a test might be, it needs to be representative of a full trading week. So you will see differences in the patterns of conversion rate and traffic on different days of the week. So if you're looking at a test period of say, ten days, that includes two weekends versus a ten day period that only includes one weekend, it's very unlikely that you will see the same picture from that. So if you're always looking at multiples of seven days, you're always taking into account the full view of the trading week for the business the test is running on.
And for the unthrottled traffic again, if you're looking to project forward what this experience might deliver longer term, you need to make sure that you're serving it to the maximum possible audience for that test. If you've got one day that’s throttled to 20%, another that’s throttled to 50%, and then it goes to 100% from there, you only want to look at the first day where for the entire 24 hours period, it was being served to 100% of the eligible visitors.
Why minimum of 14 days?
So for us, this is to counteract any anomalous events that might occur. So, for example, the one day a month where you send your monthly newsletter that could deliver a significant spike in traffic, it will also deliver a different type of traffic during that time. It will be heavily weighted towards either returning customers or existing customers, depending on the business. So what we look at is having that minimum of 14 day period that helps us smooth out any of those anomalous elements over the course of that time.
Why do I need at least 100 conversions per experiment?
So this is not a hard and fast rule and it's not about statistics. How we liken it is if one customer complained about something, you probably wouldn't change your entire business because of it. If 99 others then complained about the exact same thing, you're going to take it more seriously. So this 100 number we use as effectively a guide towards critical mass. You know, three visitors doing something differently is of marginal importance. If you've got treble figures of visitors doing things differently because of what you have done, then it's a completely different story. If you accept that your CRO programme exists to help inform decision making about your business, that's where the absolute volume becomes a point that you really need to pay attention to.
What’s the right statistical level to use?
And then lastly, what's the right statistical level to use? So this question has been asked repeatedly over the years and as far as we are concerned, there isn't one. There is not one statistical method that will tell you categorically that something is better than another or guaranteed to be better. What you're looking at here with statistics is understanding, I guess, the attitude to risk that you're willing to take.
As Aline mentioned earlier, we're generally looking at scores of 90% plus. If you've got a 90% chance to beat control, for example, it means that nine times out of ten, you rerun that test, you get the same outcome that your experiment outperforms your control. It doesn't mean that it will outperform it to the same level.
So, yeah, this is not about creating certainty. If you're looking for certainty, you won't find it here! What it is there to do is help inform the decisions that you're making as a business. The alternative to this is to not look at any statistics at all and just guess. And if you're going to do that, you may as well not run tests at all. It's just there to give you an indication of what the best decision for your business might be.