New study shows why simulated reasoning AI models don’t yet live up to their billing

https://arstechnica.com/feed/ Hits: 46

Summary

A screenshot of the 2025 USAMO Problem #1 and a solution, shown on the AoPSOnline website. Credit: AoPSOnline The US Math Olympiad (USAMO) serves as a qualifier for the International Math Olympiad and presents a much higher bar than tests like the American Invitational Mathematics Examination (AIME). While AIME problems are difficult, they require integer answers. USAMO demands contestants write out complete mathematical proofs, scored for correctness, completeness, and clarity over nine hours and two days. The researchers evaluated several AI reasoning models on the six problems from the 2025 USAMO shortly after their release, minimizing any chance the problems were part of the models' training data. These models included Qwen's QwQ-32B, DeepSeek R1, Google's Gemini 2.0 Flash Thinking (Experimental) and Gemini 2.5 Pro, OpenAI's o1-pro and o3-mini-high, Anthropic's Claude 3.7 Sonnet with Extended Thinking, and xAI's Grok 3. An April 25, 2025, screenshot of the researchers' MathArena website showing accuracy scores for SR models on each problem in the USAMO. Credit: MathArena While one model, Google's Gemini 2.5 Pro, achieved a higher average score of 10.1 out of 42 points (~24 percent), the results otherwise showed a massive performance drop compared to AIME-level benchmarks. The other evaluated models lagged considerably further behind: DeepSeek R1 and Grok 3 averaged 2.0 points each, Google's Flash-Thinking scored 1.8, Anthropic's Claude 3.7 managed 1.5, while Qwen's QwQ and OpenAI's o1-pro both averaged 1.2 points. OpenAI's o3-mini had the lowest average score at just 0.9 points (~2.1 percent). Out of nearly 200 generated solutions across all tested models and runs, not a single one received a perfect score for any problem. While OpenAI's newly released 03 and o4-mini-high were not examined for this study, benchmarks at the researchers' MathArena website show o3-high scoring 21.73 percent overall and o4-mini-high scoring 19.05 percent overall on USAMO. However, t...

First seen: 2025-04-25 21:57

Last seen: 2025-04-27 19:17

Read Full Article More from this Source

New study shows why simulated reasoning AI models don’t yet live up to their billing

Summary

Related News

Can’t understand dialogue on TV shows? Netflix has a new feature for you.

FBI offers $10 million for information about Salt Typhoon members

New study: There are lots of icy super-Earths

MyPillow CEO’s lawyers used AI in brief citing fictional cases, judge says

“You wouldn’t steal a car” anti-piracy campaign may have used pirated fonts