OpenAI Misled You on RLHF

https://news.ycombinator.com/rss Hits: 11
Summary

JoyCaption, the open source image captioning model that I work on as a hobby, recently had its Beta One release, which was also the first release that used Reinforcement Learning (RL) to improve the model. RL has been a bit of a hot topic ever since the introduction of DeepSeek R1, and for good reason. But I believe people have a fundamental misunderstanding of RL and what its uses are, due in large part to OpenAI. In this article I want to shine a light on the mysteries of RL, as well as dive into specific details of how JoyCaption was run through the RL gauntlet. I love sharing my tools and knowledge with the community, to help others build cool stuff, which is why I’ve previously shared details on my models. So, with all that said, the first half will be dedicated to running through what Reinforcement Learning is and how OpenAI misled people. If you don’t care about that, or are already versed, and want to jump straight into the juicy details of training a model, skip ahead to “How RL was used in JoyCaption: Beta One”. What Is Reinforcement Learning The concepts around RL are often explained in obtuse and difficult to understand terms, when in reality, in the context of Large Language Models, RL is a very simple extension to the “normal” way of post-training LLMs. The fancy term for the normal way these models are trained is Supervised Finetuning (SFT), a term that has gained more popularity to help differentiate it from newer training methods. If you know how models were trained in the past, then you know what SFT is, which means you already know something about RL! This is because SFT is a subset of RL. With that in mind, the easiest path to learn RL is by starting from SFT and adding one little extension at a time until we get to full blown RL. With traditional SFT you have a dataset of examples where each example is (prompt, response). During training those run through the model and the probabilities of the desired responses are driven up, using some form of ...

First seen: 2025-08-17 09:34

Last seen: 2025-08-17 19:36