Web Bench: a new way to compare AI browser agents

https://news.ycombinator.com/rss Hits: 3
Summary

TL;DR: Web Bench is a new dataset to evaluate web browsing agents that consists of 5,750 tasks on 452 different websites, with 2,454 tasks being open sourced. Anthropic Sonnet 3.7 CUA is the current SOTA, with the detailed results here.Over the past few months, Web Browsing agents such as Skyvern, Browser-use and OpenAI's Operator (CUA) have taken the world by storm. These agents have been used in production for a variety of tasks, from helping people apply to jobs, downloading invoices, and even doing SS4 filings for newly incorporated companies. Skyvern attempting to purchase a productSkyvern attempting to fill out the IRS formMost agents report state of the art performance, but we find that browser agents still struggle with a wide variety of tasks, particularly ones involving authentication, form filling and file downloading. This is because the standard benchmark today (WebVoyager) focuses on read-heavy tasks and consists of only 643 tasks across only 15 websites (out of 1.1 billion possible websites!). While a great starting point, the benchmark does not capture the internet’s adversarial nature towards browser automation and the difficulty of tasks involving mutating of data on a website.Can’t access chase.comCan’t close a popup dialogAs a result, we partnered with Halluminate and created a new benchmark to better quantify these failures. Our goal was to create a new consistent measurement system for AI Web Agents by expanding the foundations created by WebVoyager by:Expanding the number of websites from 15 → 452, and tasks from 642 -> 5,750 to test agent performance on a wider variety of websitesIntroduce the concept of READ vs WRITE tasksREAD tasks involve navigating websites and fetching dataWRITE tasks involve entering data, downloading files, logging in, solving 2FA, etc and were not well represented in the WebVoyager datasetMeasure the impact of browser infrastructure (eg access the websites, solve captchas, not crash, etc)We’re excited to announce Web ...

First seen: 2025-05-29 21:07

Last seen: 2025-05-29 23:08