The upcoming GPT-3 moment for RL Matthew Barnett, Tamay Besiroglu, Ege Erdil Jun 20, 2025 GPT-3 showed that simply scaling up language models unlocks powerful, task-agnostic, few-shot performance, often outperforming carefully fine-tuned models. Before GPT-3, achieving state-of-the-art performance meant first pre-training models on large generic text corpora, then fine-tuning them on specific tasks. Today’s reinforcement learning is stuck in a similar pre-GPT-3 paradigm. We first pre-train large models, and then painstakingly fine-tune them on narrow tasks in highly specialized environments. But this approach suffers from a fundamental limitation: the resulting capabilities generalize poorly, leading to brittle performance that rapidly deteriorates outside the precise contexts seen during training. We think RL will soon have its own GPT-3 moment. Rather than fine-tuning models on a small number of environments, we expect the field will shift toward massive-scale training across thousands of diverse environments. Doing this effectively will produce RL models with strong few-shot, task-agnostic abilities capable of quickly adapting to entirely new tasks. But achieving this will require training environments at a scale and diversity that dwarf anything currently available. How much RL will this take? Current RL datasets are relatively small. For example, DeepSeek-R1 was trained on roughly 600k math problems, representing about six years of continuous human effort if each task takes five minutes to complete. By contrast, reconstructing GPT-3’s 300-billion-token training corpus would require on the order of tens of thousands of years of human writing at typical human writing speeds. Incidentally, achieving RL compute expenditure comparable to current frontier-model pretraining budgets will likely require about ~10k years of model-facing task-time, measured in terms of how long humans would take to perform the same tasks. DeepSeek-R1 used about 6e23 FLOP during the RL sta...
First seen: 2025-07-13 10:53
Last seen: 2025-07-13 21:57