Not Even Bronze: Evaluating LLMs on 2025 International Math Olympiad

https://news.ycombinator.com/rss Hits: 3
Summary

Introduction Recent progress in the mathematical capabilities of LLMs have created a need for increasingly challenging benchmarks. With MathArena, we address this need by evaluating models on difficult and recent mathematical competitions, offering benchmarks that are both uncontaminated and interpretable. Among these competitions, the International Mathematical Olympiad (IMO) stands out as the most well-known and prestigious. As such, an evaluation of the IMO 2025, which took place just a few days ago, is a necessary addition to the MathArena leaderboard. In this post, we present our methodology and results from evaluating several state-of-the-art models on the IMO 2025 problems. Our goal was to evaluate whether these models could reach key milestones corresponding to medal-level performance: bronze (top 50%), silver (top 25%), or even gold (top 8%). To investigate the true limits of current LLMs, we used a best-of-n selection methods to scale inference-time compute as much as possible in an attempt to reach one of these milestones. The best-performing model is Gemini 2.5 Pro, achieving a score of 31% (13 points), which is well below the 19/42 score necessary for a bronze medal. Other models lagged significantly behind, with Grok-4 and DeepSeek-R1 in particular underperforming relative to their earlier results on other MathArena benchmarks. We also share some initial qualitative observations in this post, but we invite the community to conduct their own analyses. Visit our website to explore the raw model outputs and dive deeper into the results. Methodology Setup We followed a methodology similar to our evaluation of the 2025 USA Math Olympiad [1]. In particular, four experienced human judges, each with IMO-level mathematical expertise, were recruited to evaluate the responses. Evaluation began immediately after the 2025 IMO problems were released to prevent contamination. Judges reviewed the problems and developed grading schemes, with each problem scored out of ...

First seen: 2025-07-19 15:29

Last seen: 2025-07-20 00:30