❤️ 15/15

Foundation: A/B testing is a randomized controlled experiment, and randomization is the only key to causation

An A/B test is fundamentally an online controlled experiment (OCE / RCT): you randomly assign users to a control group (A, the status quo) and a treatment group (B, the new feature), then compare the outcome metrics of the two groups (Kohavi/Tang/Xu).

The key word is *random*. Randomization makes the two groups equal in expectation across all known and unknown variables, so after assignment the only systematic difference between them is whether they received the treatment (the new feature). The difference in outcomes can therefore be attributed cleanly to the treatment — something observational data cannot do.

Remember it in one line: correlation ≠ causation, and only a randomized experiment can establish causation. Among all methods for establishing causation, the controlled experiment is the most reliable gold standard.

🔆Think of random assignment as thoroughly shuffling the deck before dealing: once it's well shuffled, the two hands have no systematic difference in 'luck.' Now you do exactly one thing to one of them (swap in the new feature), and any final gap in who wins can only be blamed on the thing you did — not on 'this hand was better to begin with.' Without shuffling (no randomization), you can never tell whether it was the feature working or whether that group of users was simply more active all along.

Random assignment alone isn't enough; before the experiment you also need to settle two things:

OEC (Overall Evaluation Criterion, a.k.a. North Star metric) — a single quantitative decision metric that combines multiple goals (revenue, engagement, retention) to judge whether the experiment should ship. A good OEC must: (1) be quickly measurable within the experiment window yet be a good proxy for long-term value; (2) align with organizational strategy; (3) not be gameable in unintended ways. Beware Goodhart's law: once a metric becomes the target being optimized, it loses its original meaning over time.

Guardrail metrics — organization-level bottom lines you do not want to degrade, such as page-load latency, crash / error rate, and uninstall rate. A single experiment may compute hundreds or thousands of metrics; only a small handful form the optimized OEC, while a set of guardrails are merely required 'not to get worse.' Even if the OEC goes up, if latency or uninstall rate degrades significantly, the change should be rejected — guardrails are the brake that pulls 'single-metric optimization' back toward 'overall health.'

📷 Image placeholder: on the left, a large funnel merging multiple metrics ('revenue / engagement / retention') into a single outlet labeled OEC; on the right, several upright guardrails side by side labeled 'latency / error rate / uninstall rate — only allowed not to worsen,' with a caption 'OEC up but hits a guardrail = rejected'