A/B Testing Quiz: Can You Dodge These 12 Experiment Traps?

Q: What is a good p-value for an A/B test?

Convention sets the significance threshold at 0.05. But p < 0.05 doesn't mean the effect is real or important — it means data this extreme would be rare if there were no effect. Check the effect size and confidence interval too, and only judge p after running to your planned sample size.

Q: How long should an A/B test run?

At least one to two full weeks — whole weekday-plus-weekend cycles. Effects often take that long to stabilize as novelty wears off, and stopping at the first p < 0.05 inflates false positives badly.

Q: What is SRM in A/B testing?

Sample Ratio Mismatch: the actual traffic split deviates from the designed split more than chance allows (commonly flagged at chi-square p < 0.0005). It signals a data-quality problem — the experiment's results can't be trusted until the root cause is found.

Q: Why do A/B tests prove causation when analytics can't?

Random assignment makes the two groups equal in expectation on every variable, known and unknown. The treatment becomes the only systematic difference, so outcome differences can be attributed to it. Observational data always leaves confounders on the table.

Q: Is this AB testing quiz free?

Yes — 12 questions, instant results and explanations, no sign-up.

12 scenes from real product teams — a peeking PM, a suspicious 50.2/49.8 split, a VP who "just knows." Can you call the trap in each one? Plain-English explanations after every question — no sign-up.

0 / 12 answered

Your test shows the new checkout beats the old one at p = 0.03. The PM announces: "There's a 97% chance the new checkout is better!" What does p = 0.03 actually mean?

The dashboard shows a conversion lift with a 95% confidence interval of [+1%, +5%]. Your analyst says: "There's a 95% probability the true lift is between 1% and 5%." Is she right?

A growth PM checks the experiment dashboard every morning, planning to stop "as soon as p dips below 0.05." On day 3 it does, and he declares victory. What's wrong?

You designed a 50/50 split. A week in, the actual split is 50.2% vs 49.8% — barely off, but the chi-square check flags it at p < 0.0005. The metrics look great, though. What now?

A redesigned homepage banner gets +30% clicks in its first two days. The team wants to stop the test early and celebrate. What's the risk?

Overall, version B converts worse than A. But split by segment, B beats A among BOTH new and returning users. How is that possible?

Your new recommendation widget lifts click-through by 5% — but page load time is significantly up, and so are uninstalls. Ship it?

A colleague argues: "Users who turned on dark mode retain better — that proves dark mode causes retention. Why bother with an A/B test?"

A two-day test on a low-traffic page comes back p = 0.20. A teammate writes in the doc: "Experiment proved the feature has no effect." What's the correct takeaway?

Q10

Your VP says: "I've shipped products for 20 years. Skip the test — I know this feature will win." What does large-scale evidence say?

Q11

The team is choosing a north-star metric to judge experiments by. Which candidate makes the best OEC?

Q12

With 50 million users in the test, revenue per user rises 0.01% at p = 0.0001. The analyst calls it "a highly significant win — ship immediately." What's the catch?

Answer all 12 questions to see your result 👆

The traps in this quiz (cheat sheet)

p-value: How surprising your data would be IF there were no real difference — P(data|H0), never "the probability the feature works."
Confidence interval: 95% is the method's long-run hit rate, not a 95% chance the truth sits inside this one interval.
Peeking / early stopping: Watching mid-test and stopping at the first p < 0.05 inflates false positives to 20–40%+.
SRM: Actual split deviates from the designed split (flagged at p < 0.0005) — a data-quality "fever"; results are void until diagnosed.
Novelty effect: New features get curiosity clicks that decay; run full weeks and watch the trend before believing early lifts.
Simpson's paradox: The aggregate trend reverses inside every subgroup, thanks to a confounder like uneven user mix.
OEC: The single decision metric: short-term measurable, causally tied to long-term value, hard to game.
Guardrail metrics: "Don't make this worse" metrics — latency, errors, uninstalls — that veto a launch even when the OEC wins.
Statistical power: Your chance of detecting a real effect; aim for ≥0.80 and compute the sample size before the test.
Randomization: The one clean route to "B caused it": it balances everything else, known and unknown, in expectation.
HiPPO: Highest-paid person's opinion — what experiments exist to overrule; only ~1/3 of good ideas actually win.

What is A/B testing?

An A/B test is an online randomized controlled experiment: users are randomly split between the current version (A) and a change (B). Randomization makes the two groups equal in expectation on everything else, so any difference in outcomes can be attributed to the change itself — the one thing observational analytics can never cleanly give you. That's why controlled experiments are called the gold standard for causation.

The traps in this quiz aren't hypothetical — they come from two decades of large-scale experimentation research, most of it collected in Ron Kohavi, Diane Tang and Ya Xu's "Trustworthy Online Controlled Experiments" (2020) and a string of KDD papers: peeking and early stopping (Johari et al., 2017), sample ratio mismatch (Fabijan et al., 2019), novelty effects (Kohavi et al., 2012), and the sobering finding that only about one third of well-designed changes actually improve key metrics (Kohavi et al., 2013).

The statistics trip up professionals, too. The American Statistical Association issued a formal statement in 2016 because p-values are so routinely misread, and Hoekstra et al. (2014) found that even active researchers endorse false statements about confidence intervals. If you missed those questions, you're in distinguished company.

Take the quiz cold, then read the one-line explanations and the cheat sheet. If a scenario feels uncomfortably familiar — a stopped-early test, an unexplained split imbalance — that's the quiz doing its job.

FAQ

What is a good p-value for an A/B test?

Convention sets the significance threshold at 0.05. But p < 0.05 doesn't mean the effect is real or important — it means data this extreme would be rare if there were no effect. Check the effect size and confidence interval too, and only judge p after running to your planned sample size.

How long should an A/B test run?

At least one to two full weeks — whole weekday-plus-weekend cycles. Effects often take that long to stabilize as novelty wears off, and stopping at the first p < 0.05 inflates false positives badly.

What is SRM in A/B testing?

Sample Ratio Mismatch: the actual traffic split deviates from the designed split more than chance allows (commonly flagged at chi-square p < 0.0005). It signals a data-quality problem — the experiment's results can't be trusted until the root cause is found.

Why do A/B tests prove causation when analytics can't?

Random assignment makes the two groups equal in expectation on every variable, known and unknown. The treatment becomes the only systematic difference, so outcome differences can be attributed to it. Observational data always leaves confounders on the table.

Is this AB testing quiz free?

Yes — 12 questions, instant results and explanations, no sign-up.