A/B Tests over Evals

https://news.ycombinator.com/rss Hits: 4
Summary

Evals are an important part of building AI products. We know this first-hand.Raindrop uses AI to monitor the performance of AI agents. We generate billions of labels a month. We detect issues, generate reports, and automatically cluster intents. We are also constantly changing and optimizing how we detect issues. If we didn’t have evals, it would be impossible to make changes without breaking productionInternally, we use a custom eval platform inspired by evalite. Many of our customers use Langsmith and Braintrust.We built Raindrop because evals just weren’t enough. Our customers pay for Raindrop alongside tools like Braintrust because Braintrust can’t tell them what they need to know.I’m writing this because Ankur, the CEO of Braintrust, recently wrote a blog post directly dismissing A/B tests, and Raindrop specifically (without naming us). In the blog post, Ankur claims that evals are the future. He claims that they help you measure how good your product is, that they are key for rapid experimentation. He also claims that evals will become increasingly important as software becomes more personalized. I believe the opposite to be true for each of these claims."The recent acquisitions of Statsig by OpenAI and Eppo by Datadog hint at the turning point: A/B testing is no longer sufficient for AI product optimization. The future is evals."Side Note: For the sake of brevity, I’m going to avoid critiquing some of the stranger, more mind-bending parts of his blog post, like the above quote… which is like saying that Google’s acquisition of Windsurf is proof that coding agents are on the way out. But first, what is an eval anyway?Right now, it feels like everyone is reaching for a new word and calling it progress. “Offline Evals” “Online Evals” “LLM judges” “Scorers”. Fancy labels, familiar ideas. When we blur definitions, we blur decisions. If you strip away the jargon, you have the two levers engineers have always used to understand change: testing changes before shippin...

First seen: 2025-11-18 13:50

Last seen: 2025-11-18 16:50