HRM Analysis by Arc Prize Organizers

https://news.ycombinator.com/rss Hits: 4

Summary

The Hidden Drivers of HRM's Performance on ARC-AGI We scored on hidden tasks, ran ablations, and found that performance comes from an unexpected source On June 8, 2025, the Hierarchical Reasoning Model (HRM) paper was published by Guan Wang et al. The release went viral within the AI community. X/Twitter discussions hit over 4 million views and tens of thousands of likes [1, 2, 3, 4] and YouTube videos dissecting the work surpassed 475K views [1, 2, 3, 4]. The headline claim of the paper: the brain-inspired architecture of HRM scored 41% on ARC-AGI-1 with only 1,000 training tasks and a 27M (relatively small) parameter model. Due to the popularity and novelty of the approach, we set out to verify HRM performance against the ARC-AGI-1 Semi-Private dataset - a hidden, hold-out set of ARC tasks used to verify that solutions are not overfit. We decided to go deeper than our typical score verification to further understand what aspects of HRM lead to better model performance. Summary of our findings: First of all: we were able to approximately reproduce the claimed numbers. HRM shows impressive performance for its size on the ARC-AGI Semi-Private sets: ARC-AGI-1: 32% - Though not state of the art, this is impressive for such a small model. ARC-AGI-2: 2% - While scores >0% show some signal, we do not consider this material progress on ARC-AGI-2. At the same time, by running a series of ablation analyses, we made some surprising findings that call into question the prevailing narrative around HRM: The "hierarchical" architecture had minimal performance impact when compared to a similarly sized transformer. However, the relatively under-documented "outer loop" refinement process drove substantial performance, especially at training time. Cross-task transfer learning has limited benefits; most of the performance comes from memorizing solutions to the specific tasks used at evaluation time. Pre-training task augmentation is critical, though only 300 augmentations are needed (...

First seen: 2025-10-15 14:43

Last seen: 2025-10-15 17:43

Read Full Article More from this Source

HRM Analysis by Arc Prize Organizers

Summary

Related News

Formal Reasoning [pdf]

You Already Have a Git Server

ICE Will Use AI to Surveil Social Media

How I turned Zig into my favorite language to write network programs in

Resource use matters, but material footprints are a poor way to measure it