The Paradigm

https://news.ycombinator.com/rss Hits: 1
Summary

Over the past decade, some of the most remarkable AI breakthroughs—AlphaGo, AlphaStar, AlphaFold1, VPT, OpenAI Five, ChatGPT—have all shared a common thread: they start with large-scale data gathering (self-supervised or imitation learning, or SSL) and then use reinforcement learning to refine their performance toward a specific goal. This marriage of general knowledge acquisition and focused, reward-driven specialization has emerged as a the paradigm by which we can reliably train AI systems to excel at arbitrary tasks. I’d like to talk about how and why this works so well. 1 – AlphaFold 2 technically does not use RL; instead it uses distillation via rejection sampling, which has similar (if less adaptable) results. Generalization In recent years, we’ve found that applying SSL to highly general datasets improves the robustness and thus usefulness of our models at downstream tasks. As a result, the models the big labs are putting out are increasingly trained on self-prediction objectives over a diverse corpus of interleaved text, images, video and audio. By comparison, RL training has stayed quite “narrow”. All of the systems I mentioned above were trained with RL that optimizes something fairly specific: for example, play a game well or be engaging and helpful to humans talking to you. Over the last year, something seems to have happened at many of the top research labs: they started investing in more “general” RL optimization. Instead of using reinforcement learning to optimize models to play one game very well, we’re optimizing them to solve complex math problems, write correct code, derive coherent formal proofs, play all games, write extensive research documents, operate a computer, etc. And this seems to be working! Reasoning models trained with general RL are leap-frogging SSL on every benchmark for measuring model performance that we know of. There’s something happening here, and it’s worth paying attention to. Some Terminology When training with an RL objec...

First seen: 2025-09-04 22:04

Last seen: 2025-09-04 22:04