How to scale RL to 10^26 FLOPs

https://news.ycombinator.com/rss Hits: 8
Summary

TLDR: Reinforcement learning (RL) is the next training technique for building frontier-level AI models. To make it better, we need to train on more data. The current approach of scaling many environments simultaneously is messy and complicated. Instead, I propose we find a way to do next-token prediction on the Web using RL. This way, we learn to reason from general web data, instead of just math and code.I’ve spent a good part of the past year in denial.I was in denial because when OpenAI released o1, and explained their paradigm of test-time compute, I thought it was a good idea but mostly a way to get better performance out of models of fixed size. After all, letting models ‘think for longer’ by generating more tokens lets them do more internal computation.The o1 release from OpenAI was the first demonstration of a new type of language model, one that could think for longer to generate better answers.So I wasn’t that surprised that these new models, termed reasoning models, gave better answers. And I especially wasn’t surprised when I found out these answers mostly came on problems that inherently require lots of computation, like difficult math and engineering test questions.Don’t get me wrong: I always thought reasoning models were interesting. It’s cool to me that they generate “thinking traces” before giving answers (although the thinking traces might not be very reliable). And it’s amazing that the models were trained with reinforcement learning, a foundational technique in machine learning that was generally understood to be difficult to use effectively for real problems.But I still thought of myself as a scale maximalist: all that really mattered, I thought, was training bigger models on more data. Anything else (read: reasoning models) appeared to be a coping mechanism, just a way to get by while we wait for the hardware needed to train bigger models.I’ve spent the past few months working on RL research at Meta. It took a bit of time but I’ve come full-ci...

First seen: 2025-07-13 19:57

Last seen: 2025-07-14 02:58