Leading voices in experimentation suggest that you test everything. Some inconvenient truths about A/B testing suggest it’s better not to.
Image created by OpenAI’s DALL-E
Those of you who work in online and product marketing have probably heard about A/B testing and online experimentation in general. Countless A/B testing platforms have emerged in recent years and they urge you to register with them and leverage the power of experimentation to get your product to new heights. Tons of industry leaders and smaller-calibre influencers alike write at length about successful implementation of A/B testing and how it was a game-changer for a certain business. Do I believe in the power of experimentation? Yes, I do. But at the same time, after upping my statistics game and getting through tons of trials and errors, I’ve discovered that, like with anything in life and business, certain things get swept under the rug sometimes, and usually those are inconvenient shortcomings of experiments that undermine their status as a magical unicorn.
To better understand the root of the problem, I’d have to start with a little bit of how online A/B testing came to life. Back in the day, online A/B testing wasn’t a thing, but a few companies, who were known for their innovation, decided to transfer experimentation to the online realm. Of course by that time A/B testing had already been a well-established method of finding out the truth in science for many years. Those companies were Google (2000), Amazon (2002), some other big names like Booking.com (2004), and Microsoft joined soon after. It doesn’t take a lot of guesses to see what those companies have in common, and they have the two most important things that matter the most to any business: money and resources. Resources are not only infrastructure, but people with expertise and know-how. And they already had millions of users on top of that. Incidentally, proper implementation of A/B testing required all of the above.
Up to this day, they remain the most recognized industry voices in online experimentation, along with those that emerged later — Netflix, Spotify, Airbnb, and some others. Their ideas and approaches are widely recognized and discussed, as well their innovations in online experiments. Things they do are considered the best practices, and it’s impossible to fit all of them into one tiny article, but a few things get mentioned more than others and they basically come down to:
test everythingnever release a change without testing it firsteven the smallest change can have a huge impact
Those are great rules indeed, but not for every company. In fact, for many product and online marketing managers, blindly trying to follow those rules may result in confusion and even disaster. And why is that? Firstly, blindly following anything is a bad idea, but sometimes we have to rely on an expert opinion for lack of our own expertise and understanding of a certain field. What we usually forget is that not all expert opinions translate well to our own business realm. The fundamental flaw of those basic principles of successful A/B testing is that they come from multi-billion corporations and you are, the reader, probably not affiliated with one of them.
This article is going to heavily pivot around the known concept of statistical power and its extension — sensitivity (of an experiment). This concept is the foundation for a decision making which I use on daily basis in my experimentation life.
The Resources
“The illusion of knowledge is worse that the absence of knowledge” (Someone smart)
If you know absolutely nothing about A/B testing, the idea may seem quite simple — just take two versions of something and compare them against each other. The one that shows a higher number of conversions (revenue per user, clicks, registrations, etc) is deemed better.
If you are a bit more sophisticated, you know something about statistical power and calculation of the required sample size for running an A/B test with the given power for detecting the required effect size. If you understand the caveats of early stopping and peeking — you are well on your way.
The misconception of A/B testing being easy gets quickly shattered when you run a bunch of A/A tests, in which we compare two identical versions against each other, and show the results to the person who needs to be educated on A/B testing. If you have a big enough number of those tests (say 20–40), they will see that some of the tests showed that the treatment (also known as the alternative variant) shows an improvement over the control (original version), and some of them show that the treatment is actually worse. When constantly monitoring the running experiments, we may see significant results approximately 20% of the time. But how is it possible if we compare two identical versions to each other? In fact, the author had this experiment conducted with the stakeholders of his company and showed these misleading results, to which one of the stakeholders replied that it was undoubtedly a “bug” and that we wouldn’t have seen anything like it if everything was set up properly.
It’s only a tip of the huge iceberg and if you already have some experience, you know that:
experimentation is far from easytesting different things and different metrics requires different approaches that go far beyond an ordinary, conventional A/B testing that most of the A/B testing platforms use. As soon as you go beyond simple testing of conversion rate, things get exponentially more difficult. You start concerning yourself with the variance and its reduction, estimating novelty and primacy effects, assessing the normality of the distribution etc. In fact, you won’t even be able to test certain things properly even if you know how to approach the problem (more on that later).you may need a qualified data scientist/statistician. In fact, you WILL definitely need more than one of them to figure out what approach you should use in your particular case and what caveats should be taken into account. This includes figuring out what to test and how to test it.you will also need a proper data infrastructure for collecting analytics and performing an A/B testing. The javascript library of your A/B testing platform of choice, the simplest solution, is not the best one since it’s associated with known issues of flickering and increased page load time.without fully understanding the context and cutting corners here and there, it’s easy to get misleading results.
Below is a simplified flowchart that illustrates the decision-making process involved in setting up and analyzing experiments. In reality, things get even more complicated since we have to deal with different assumptions like homogeneity, independence of observations, normality etc. If you’ve been around for a while, those are words you are familiar with, and you know how hard taking everything into account may get. If you are new to experimentation, they won’t mean anything to you, but hopefully they’ll give you a hint that maybe things are not as easy as they seem.
Image by Scribbr, with permission
Small to medium size companies may struggle with allocation of the required resources for setting up proper A/B testing environment and launching every next A/B test may be a time-consuming task. But that is only one part of the problem. By the end of this article you’ll hopefully understand, why, given all of the above, when a manager drops me a message asking that we “Need to test this” I often reply “Can we?”. Really, why can’t we?
The Users and the Sensitivity
The majority of successful experiments at companies like Microsoft and AirBnb had an uplift of less than 3%
Those of you who are familiar with the concept of statistical power, know that the more randomization units we have in each group (for the sake of simplicity lets refer to them as “users”), the higher the chance you will be able to detect the difference between the variants (all else being equal), and that’s another crucial difference between huge companies like Google and your average online business —yours may not have nearly as many users and traffic for detecting small differences of up to 3%, even detecting something like 5% uplift with an adequate statistical power (the industry standard is 0.80) may be a challenge.
Detectable Uplift for different sample sizes at alpha 0.05, power 0.80, base mean of 10 and std. 40, equal variance. (Image by the author)
On the sensitivity analysis above we can see, that detecting the uplift of roughly 7% is relatively easy with only 50000 users per variant required, but if we want to make it 3%, the number of users required is roughly 275000 per variant.
Friendly tip: G*Power is a very handy piece of software for doing power analysis and power calculations of any kind, including sensitivity in testing difference between two independent means. And although it shows the effect size in terms of Cohen’s d, the conversion to uplift is straightforward.
A screenshot of the test sensitivity calculation performed in G*Power. (Image by the author)
With that knowledge there are two routes we can take:
We can come up with an acceptable duration for the experiment, calculate MDE, launch the experiment and, in case we don’t detect the difference, we scrap the change and assume that if the difference exists, it’s not higher than MDE at the power of 0.99 and the given significance level (0.05).We can decide on the duration, calculate MDE and in case MDE is too high for the given duration, we simply decide to either not launch the experiment or release the change without testing it (the second option is how I do things).
In fact, the first approach was mentioned by Ronny Kohavi on LinkedIn:
The downside of the first approach, especially if you are a startup or small business with limited resources, is that you keep funneling resources into something that has very little chance to give you actionable data.
Running experiments that are not sensitive enough may lead to fatigue and demotivation among members of the team involved in experimentation
So, if you decide to chase that holy grail and test everything that gets pushed to production, what you’ll end up with is:
designers spend days, sometimes weeks, designing an improved version of a certain landing page or section of the productdevelopers implement the change through your A/B testing infrastructure, which also takes timedata analysts and data engineers set up additional data tracking (additional metrics and segments required for the experiment)QA team tests the end result (if you are lucky, everything is fine and doesn’t need to be re-worked)the test is pushed to production where it stays active for a month or twoyou and the stakeholders fail to detect a significant difference (unless you run your experiment for a ridiculous amount of time thus endangering its validity).
After a bunch of tests like that, everybody, including the top growth voice of the company loses motivation and gets demoralized by spending so much time and effort on setting up tests just to end up with “there is no difference between the variants”. But here’s where the wording plays a crucial part. Check this:
there is no significant difference between the variantswe have failed to detect the difference between the variants. It may still exist and we would have detected it with high probability (0.99) if it were 30% or higher or with a somewhat lower probability (0.80) if it were 20% or higher.
The second wording is a little bit more complicated but is more informative. 0.99 and 0.80 are different levels of statistical power.
It better aligns with the known experimentation statement of “absence of evidence is not evidence of absence”.It sheds light on how sensitive our experiment was to begin with and may expose the problem companies often encounter — limited amount of traffic for conducting well-powered experiments.
Coupled with the knowledge Ronny Kohavi provided in one of his white papers, that claimed that the majority of experiments at companies he worked with had the uplift of less than 3%, it makes us scratch our heads. In fact, he recommends in one of his publication to keep MDE at 5%.
I’ve seen tens of thousands of experiments at Microsoft, Airbnb, and Amazon, and it is extremely rare to see any lift over 10% to a key metric. [source]My recommended default as the MDE to plug-in for most e-commerce sites is 5%. [source]At Bing, monthly improvements in
revenue from multiple experiments were usually in the low single digits. [source, section 4]
I still believe that smaller companies with an underoptimized product who only start with A/B testing, may have higher uplifts, but I don’t feel it will be anything near 30% most of the time.
The Problem
When working on your A/B testing strategy, you have to look at a bigger picture: available resources, amount of traffic you get and how much time you have on your hands.
So, what we end up having, and by “us” I mean a considerable number of businesses who only start their experimentation journey, is tons of resources spent on designing, developing the test variant, resources spent on setting up the test itself (including setting up metrics, segments, etc) — all this combined with a very slim chance of actually detecting anything in a reasonable amount of time. And I should probably re-iterate that one shouldn’t put too much faith in thinking that the true effect of their average test is going to be whooping 30% uplift.
I’ve been through this and we’ve had many failed attempts to launch experimentation at SendPulse and it always felt futile until not that long ago, when I realized that I should think outside A/B tests and look at a bigger picture, and the bigger picture is this.
you have finite resourcesyou have finite traffic and usersyou won’t always have the right conditions for running a properly powered experiment, in fact, if you are a smaller business, those conditions will be even more rare.you should plan experiments in the context of your own company and carefully allocate resources and be reasonable by not wasting them on a futile tasknot running an experiment on the next change is fine, although not ideal — businesses succeeded long before online experimentation was a thing. Some of your changes will have negative impact and some — positive, but it’s OK as long as the positive impact overpowers the negative one.if your not careful and is too zealous about experimentation being the only true way, you may channel most of your resources into a futile task, putting your company into a disadvantageous position.
Below is a digram which is known as “Hierarchy of Evidence”. Although personal opinions are at the base of the pyramid, it still counts for something, but it’s better to embrace the truth that sometimes it’s the only reasonable option, however flawed it is, given the circumstances. Of course, randomized experiments are much higher up in the pyramid.
Hierarchy of Evidence in Science. (Image by CFCF, via Wikimedia Commons, licensed under CC BY-SA 4.0).
The Solution
In a more traditional setting, the flow for launching an A/B test goes something like this:
someone comes up with an idea of a certain changeyou estimate the required resources for implementing the changethose involved make the change come true (designers, developers, product managers)you set up MDE (minimum detectable effect) and the other parameters (alpha, beta, type of test — two-tailed, one-tailed)you calculate the required sample size and find out how long the test have to run given the parametersyou launch the test
As covered above, this approach is the core of “experiment-first” design — the experiment comes first at whatever cost and the required resources will be allocated. The time it takes to complete an experiment isn’t an issue either. But how would you feel if you discovered that it takes two weeks and 3 people to implement the change and the experiment has to run 8–12 month to be sensitive enough? And remember, stakeholders do not always understand the concept of the sensitivity of an A/B test, so justifying holding it for a year may be a challenge, and the world is changing rapidly for this to be acceptable. Let alone technical things that compromise test validity, cookies getting stale being one of them.
In the conditions when we have limited resources, users and time, we may reverse the flow and make it “resource-first” design, which may be a reasonable solution in your circumstances.
Assume that:
an A/B test based on a pseudo-user-id (based on cookies that go stale and get deleted sometimes) is more stable with shorter running times, so let’s make it 45 days tops.an A/B test based on a stable identifier like user-id may afford extended running times (3 months for conversion metrics and 5 months for revenue-based metrics, for instance).
What we do next is:
see how much units we can gather for each variant in 45 days, let’s say it’s 30 000 visitors per variantcalculate the sensitivity of your A/B test given the available sample size, alpha, the power and your base conversion rateif the effect is reasonable enough (anything from 1% to 10% uplift), you may consider allocating the required resources for implementing the change and setting up the testif the effect is anything higher than 10%, especially if it’s higher than 20%, allocating the resources may be an unwise idea since the true uplift from you change is likely going to be lower and you won’t be able to reliably detect it anyway
I should note that the maximum experiment length and the effect threshold are up to you to decide, but I found that these worked just fine for us:
the maximum length of an A/B test on the website — 45 daysthe maximum length of an A/B test based on conversion metrics in the product with persistent identifiers (like user_id)— 60 daysthe maximum length of an A/B test based on revenue metrics in the product 120 days
Sensitivity thresholds for the go-no-go decision:
up to 5% — perfect, the launch is totally justified, we may allocate more resources on this one5%-10% —good, we may launch it, but we should be careful about how much resources we channel into this one10–15% — acceptable, we may launch it if we don’t have to spend too much resources — limited developer time, limited designer time, not much in terms of setting up additional metrics and segments for the test15–20%— barely acceptable, but if you need fewer resources, and you face the strong belief in success, the launch may be justified. Yet you may inform the team of the poor sensitivity of the test.>20% — unacceptable. launching tests with the sensitivity that low is only justified in rare cases, consider what you may change in the design of the experiment to improve the sensitivity (maybe the change can be implemented on several landing pages instead of one, etc).Experiment categorization based on sensitivity (Image by the author)
Note, that in my business setting we allow revenue-based experiments to run longer because:
increase in the revenue is the highest priorityrevenue-based metrics have higher variance and hence lower sensitivity compared to conversion-based metrics, all things being equal
After some time we have developed an understanding as to what kind of tests are sensitive enough:
changes across the entire website or a group of pages (as opposed to a single page)changes “above the fold” (changes to the first screen of a landing page)changes to the onboarding flow in the service (since it’s only the start of the user journey in the service, the number of the users is maxed-out here)we mostly experiment only on new users, omitting the old ones (so as not to deal with estimating possible primacy and novelty effects).
The Source of Change
I should also introduce the term “the source of change” to expand on my idea and methodology further. At SendPulse, like any other company, things get pushed to production all the time, including those that deal with the user interface, usability and other cosmetics. They‘d been released long before we introduced experimentation because, you know, a business can’t stand still. At the same time, there are those changes that we specifically would like to test, for example someone comes up with an interesting but a risky idea, and that we wouldn’t release otherwise.
In the first case resources are allocated no matter what and there’s a strong believe the change has to be implemented. It means the resources we spend to test it are only those for setting up the test itself and not developing/designing the change, let’s call it “natural change”.In the second case, all resources committed to the test include designing, developing the change and setting up the experiment, let’s name it “experimental change”.
Why this categorization? Remember, the philosophy I’m describing is testing what makes sense to be tested from the sensitivity and resources point of view, without causing much disruption in how things have been done in the company. We do not want to make everything dependent on experimentation until the time comes when the business is ready for that. Considering everything we’ve covered so far, it makes sense to gradually slide experimentation into the life of the team and company.
The categorization above allows us to use the following approach when working with “natural changes”:
if we are considering testing the “natural change”, we look only at how much resources we need to set up the test, and even if the sensitivity is over 20% but the resources needed are minimal, we give the test a go.if we don’t see the drop in the metric, we stick to the new variant and roll it out to all users (remember, we planned to release it anyway before we decided to test it)so, even if the test wasn’t sensitive enough to detect the change, we just set ourselves up with a sort of “guardrail” — on the off chance the change really dropped the metric by quite a lot. We don’t try to block rolling out the change by seeking definitive evidence that it’s better — it’s just a precaution measure.
On the other hand, when working with “experimental changes”, the protocol may differ:
we need to base our decision on the “sensitivity” and it plays a crucial role here, since we look at how much resources we need to allocate to implement the change and the test itself, we should only commit to work if we have a good shot at detecting the effectif we don’t see the uplift in the metric, we gravitate towards discarding the change and leaving the original, so, resources may be wasted on something we will scratch later — they should be carefully managed
The Results (Hopefully Positive)
How exactly does this strategy help a growing business to adapt to experimentation mindset? I feel that the reader have figured it out by this time, but it never hurts to recap.
you give your team time to adapt to experimentation by gradually introducing A/B testing.you don’t spend limited resources on experiments that won’t have enough sensitivity, and resources IS AN ISSUE for a growing startup — you may need them somewhere elseas a result, you don’t urge the rejection of A/B testing by nagging your team with running experiments that are never statistically significant, despite spending tons of time on launching them — when a high proportion of your tests shows something significant, the realization sinks in that it hasn’t been in vain.by testing “natural changes”, things that the team thinks should be rolled out even without an experiment, and only rejecting them when they show a statistically significant drop, you don’t cause too much disruption, but if the test does show a drop, you sow a seed of doubt that shows that not all our decisions are great
The important thing to remember — A/B tests aren’t something trivial, they require tremendous effort and resources to do them right. Like with anything in this world, we should know our limits and what we are capable of at this particular time. Just because we want to climb Mount Everest doesn’t mean we should do it without understanding our limits — there are lots of corpses of startups on the figurative Mount Everest who went way beyond what they were capable of.
Good luck in your experimenting!
Not A/B Testing Everything is Fine was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.