RL is more information inefficient than you thought

https://news.ycombinator.com/rss Hits: 8
Summary

Recently, people have been talking about how it takes way more FLOPs to get a single sample in RL than it does in supervised learning. In pretraining, you get a signal on every single token you train on. In RL, you have to unroll a whole thinking trajectory that’s 10s of 1000s of tokens long in order to get a single reward signal at the end (for example, did the unit test for my code pass/did I get the right answer to this math problem/etc).But this is only half the problem. Here’s a simple way to compare the learning efficiency of reinforcement learning versus supervised learning:Bits/FLOP = Samples/Flop * Bits/Sample.What I haven’t heard people talk about is the other term in our equation: Bits/Sample. And for most of training, the information density per sample is way way lower for RL.In supervised learning (aka pretraining), you’re just soaking up bits. Every token is a hint at the structure of language, and the mind crafting that language, and the world that mind is seeing. Early in training, when you have a totally random model, you’re just maximally uncertain over all of this content. So each token is just blowing your mind. And you’re getting this exact signal of how wrong you were about the right answer, and what parameters you need to update to be less wrong.Suppose you start with a randomly initialized model, and you kickstart training. If you’re doing next-token-prediction using supervised learning on “The sky is”, the training loop goes, “It’s actually ‘blue’. You said the probability of ‘blue’ is .001%. Make the connections that were suggesting ‘blue’ way way stronger. Alright, next token.”In RL with policy gradient, you upweight all the trajectories where you get the answer right, and downweight all the trajectories where you get the answer wrong. But a model that’s not already very smart is just astonishingly unlikely to get the answer right.If you were doing next-token-prediction on “The sky is” with RL, the training loop would be something like, “O...

First seen: 2025-11-30 12:46

Last seen: 2025-11-30 20:47