Benchmark: Lab-Play Early signs of life for production automation Lab-play is a highly constrained environment, where agents are given a fixed set of resources and a single target entity to maximize production throughput. This simple setting has only a tiny fraction of the complexity of open-play, where agents spawn in a procedurally generated map and must achieve a complex goal given no starting inventory and sparser resources. Agents write Python using the FLE API to interact with the game, and observe the standard outputs and error messages from their execution. We replicate the methodology from the original FLE paper for the lab-play setting to evaluate the strongest models as of September 2025. The standardized agent harness is minimal: it continuously appends environment interactions to a single conversational history, and when the token budget is nearing exhaustion, it invokes the agent to summarize the older history so it can continue reasoning while remaining aware of past interactions. We do not evaluate agents with backtracking and/or reflection logic as we did in FLE 0.2.0, and instead we encourage the community to experiment with more advanced agent designs. Setting Objective: to achieve production throughput targets of 16 per minute for solid items and 250 per minute for fluids. Prompt: documentation of the FLE API, Factorio recipes, and a guide describing common patterns. Inventory: a set of useful items for building functional factories. Max Steps: 64 steps with early stopping upon completion. Reasoning: default settings ({"enabled": true}) for models that support reasoning. Model Performance on Lab-Play Tasks (Pass@8 - Throughput Level 1) Open source models have caught up to the SoTA performance observed in v0.2.0 (May 2025), with successes in electronic circuits, steel plate, sulfur and plastic automation. This is consistent with trends showing that the time for open source models to reach parity with closed source results is diminishing. Discussio...
First seen: 2025-10-03 22:54
Last seen: 2025-10-04 00:55