P-Hacking in Startups

https://news.ycombinator.com/rss Hits: 24
Summary

Speed kills rigor. In startups, the pressure to ship fast pushes teams to report anything that looks like an improvement. That’s how p-hacking happens. This piece breaks down three common cases—and how to avoid them. Example 01: Multiple comparisons without correction Imagine you're a product manager trying to optimize your website’s dashboard. Your goal is to increase user signups. Your team designs four different layouts: A, B, C, and D. You run an A/B/n test. Users are randomly assigned to one of the four layouts and you track their activity. Your hypothesis is: layout influences signup behavior. You plan ship the winner if the p-value for one of the layout choices falls below the conventional threshold of 0.05. Then you check the results: Option B looks best. p = 0.041. It floats to the top as if inviting action. The team is satisfied and ships it. But the logic beneath the 0.05 cutoff is more fragile than it appears. That threshold assumes you’re testing a single variant. But you tested four. That alone increases the odds of a false positive. Let’s look at what that actually means. Setting a p-value threshold of 0.05 is equivalent to saying: "I’m willing to accept a 5% chance of shipping something that only looked good by chance." So the probability that one test doesn’t result in a false positive is: 1−0.05=0.951−0.05=0.951−0.05=0.95 Now, if you run 4 independent tests, the probability that none of them produce a false positive is: 0.95×0.95×0.95×0.95=0.81450.95 \times 0.95 \times 0.95 \times 0.95 = 0.81450.95×0.95×0.95×0.95=0.8145 That means the probability that at least one test gives you a false positive is: 1−0.8145=0.18551 − 0.8145 = 0.18551−0.8145=0.1855 So instead of working with a 5% false positive rate, you’re actually closer to 18.5%: nearly a 1 in 5 risk that you're shipping something based on a fluke. And that risk scales quickly. The more variants you test, the higher the odds that something looks like a win just by coincidence. Statistically, the...

First seen: 2025-06-21 23:43

Last seen: 2025-06-22 23:00