Thinking · Conversion & Experimentation

What ten years of A/B testing taught me about trusting data

Statistical significance is table stakes. The hard part is knowing when your test is lying to you.

Why do good experimentation programs still make bad decisions?

Because significance is a floor, not a ceiling. A p-value of 0.03 means the test cleared the statistical bar. It does not mean the intervention will hold up in production, generalize to a different season, or survive the next site-wide launch. Most of the bad calls I have watched teams make were technically significant.

The failure mode is almost always the same: a team ships a winner, moves on, and never checks whether the lift actually accrued to the top line. Six months later, revenue is roughly where it would have been without the program, and no one can explain why.

What are the three rules that actually change program outcomes?

Run tests to their pre-committed sample size, even when the number looks great at day four. Early stopping is the single largest source of overstated wins I have seen. Sequential testing tools exist for a reason; if you are not using one, honor the sample size.

Freeze the environment. Do not launch a homepage refresh in the middle of a checkout test. If you must, extend the test window and log the confound in the write-up. A significant winner won against a moving target is not a winner.

Report absolute revenue, not just relative lift. A 12% lift on a page that gets 800 sessions a week is a rounding error. Teams that only report percentage lifts inflate their program's apparent impact by an order of magnitude, and eventually a CFO notices.

When should you kill a test that looks like it's winning?

When the win depends on a segment that is not your target segment. I have watched teams celebrate a mobile checkout redesign that lifted conversion on iPad — which represented 3% of traffic and 1% of revenue — while trending down on iPhone. Segment-cut your results before you call anything.

Also when the win is driven by a bug. If a new variant has a broken analytics event on one path, its conversion rate looks higher because half its data is missing. Instrument every variant identically. Test the instrumentation before you test the change.

What have I stopped believing?

I have stopped believing that CRO lifts compound. In most B2B contexts they do not. The next test finds the ceiling of the last one, or the seasonal cycle absorbs the lift, or the audience mix shifts. I now model program value as a portfolio of independent bets, not as a stacking sequence. That change alone made my forecasts hold up.

I have also stopped believing that every test needs to win. Null results are program hygiene. If your win rate is above 40%, you are almost certainly stopping early, mis-analyzing, or only running easy tests. A 20–30% win rate on well-designed tests is a healthy program.

What I would do differently

Earlier in my career I would launch a test the day the design was ready. Now I write the analysis plan first. What is the primary metric, what is the secondary, what is the guardrail, what will make me kill the test, what will make me extend it. The tests I plan this way ship less often and settle more questions per ship. It is the highest-leverage change I have made.

Newsletter

New writing like this, two or three times a month.

Related