The rise of AI ‘reasoning’ models is making benchmarking more expensive

https://techcrunch.com/feed/ Hits: 5
Summary

AI labs like OpenAI claim that their so-called “reasoning” AI models, which can “think” through problems step by step, are more capable than their non-reasoning counterparts in specific domains, such as physics. But while this generally appears to be the case, reasoning models are also much more expensive to benchmark, making it difficult to independently verify these claims. According to data from Artificial Analysis, a third-party AI testing outfit, it costs $2,767.05 to evaluate OpenAI’s o1 reasoning model across a suite of seven popular AI benchmarks: MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2024, and MATH-500. Benchmarking Anthropic’s recent Claude 3.7 Sonnet, a “hybrid” reasoning model, on the same set of tests cost $1,485.35, while testing OpenAI’s o3-mini-high cost $344.59, per Artificial Analysis. Some reasoning models are cheaper to benchmark than others. Artificial Analysis spent $141.22 evaluating OpenAI’s o1-mini, for example. But on average, they tend to be pricey. All told, Artificial Analysis has spent roughly $5,200 evaluating around a dozen reasoning models, close to twice the amount the firm spent analyzing over 80 non-reasoning models ($2,400). OpenAI’s non-reasoning GPT-4o model, released in May 2024, cost Artificial Analysis just $108.85 to evaluate, while Claude 3.6 Sonnet — Claude 3.7 Sonnet’s non-reasoning predecessor — cost $81.41. Artificial Analysis co-founder George Cameron told TechCrunch that the organization plans to increase its benchmarking spend as more AI labs develop reasoning models. “At Artificial Analysis, we run hundreds of evaluations monthly and devote a significant budget to these,” Cameron said. “We are planning for this spend to increase as models are more frequently released.” Artificial Analysis isn’t the only outfit of its kind that’s dealing with rising AI benchmarking costs. Ross Taylor, the CEO of AI startup General Reasoning, said he recently spent $580 evaluating Claude 3.7 Sonne...

First seen: 2025-04-10 13:44

Last seen: 2025-04-10 17:45