Does RL scale? Over the past few years, we've seen that next-token prediction scales, denoising diffusion scales, contrastive learning scales, and so on, all the way to the point where we can train models with billions of parameters with a scalable objective that can eat up as much data as we can throw at it. Then, what about reinforcement learning (RL)? Does RL also scale like all the other objectives? Apparently, it does. In 2016, RL achieved superhuman-level performance in games like Go and Chess. Now, RL is solving complex reasoning tasks in math and coding with large language models (LLMs). This is great. However, there is one important caveat: most of the current real-world successes of RL have been achieved with on-policy RL algorithms (e.g., REINFORCE, PPO, GRPO, etc.), which always require fresh, newly sampled rollouts from the current policy, and cannot reuse previous data (note: while PPO-like methods can technically reuse data to some (limited) degree, I'll classify them as on-policy RL, as in OpenAI's documentation). This is not a problem in some settings like board games and LLMs, where we can cheaply generate as many rollouts as we want. However, it is a significant limitation in most real-world problems. For example, in robotics, it takes more than several months in the real world to generate the amount of samples used to post-train a language model with RL, not to mention that a human must be present 24/7 next to the robot to reset it during the entire training time! On-policy RL can only use fresh data collected by the current policy \(\pi\). Off-policy RL can use any data \(\mathcal{D}\). This is where off-policy RL comes to the rescue. In principle, off-policy RL algorithms can use any data, regardless of when and how it was collected. Hence, they generally lead to much better sample efficiency, by reusing data many times. For example, off-policy RL can train a dog robot to walk in 20 minutes from scratch in the real world. Q-learning is the most...
First seen: 2025-06-15 02:00
Last seen: 2025-06-15 20:07