Baba is Eval 3rd of July, 2025 Your browser does not support the video tag. Baba is You is a sokoban puzzle game where the rules themselves have to be manipulated to win. (For the uninitiated, the store page should explain the idea best.) The level of abstraction required to solve most levels makes it a formidable reasoning benchmark, with many reasoning steps being completely orthogonal to other tasks out there. The game is turn-based, meaning the number of turns required to solve a level naturally serves as a more fine-grained metric beyond accuracy. This renders Baba is You quite similar to the proposed ARC-AGI-3 benchmark, scheduled for release in 2026. Except it already exists! That is, however, also the main problem for using it as a serious benchmark: The solutions for the main game are out there in both text and image form. Luckily though, if that ends up being a problem, there is also a wealth of clever and high-quality levels, and even level packs with entirely new mechanics, created by players. Those mostly don't have a solution published online. Inspired by Claude plays Pokémon and the Factorio Learning Environment, in this devlog we'll turn Baba is You into a demo version of Baba is Eval. Desiderata Be it Factorio or ARC-AGI, usually current multimodal models still do best with a text-representation of the 2D world. Screenshots are often less helpful. Therefore, we need to implement (1) fetching the game state into the language model context. Then, (2) the model should be able to control the level, which involves only the principal actions left, right, up, down, undo and reset. Ideally, this would be faster than a human takes to input. We'll also want (3) menu navigation to completely automate the state management. Fetching Game State Humans interact with the game visually, so the first thought might be to read it in via a vision model. In the case of Baba is You though, it pays off to look at the game files exposed to us. Opening the game files, we see...
First seen: 2025-07-05 03:14
Last seen: 2025-07-05 17:18