Reinforcement Learning from Human Feedback (RLHF) in Notebooks

https://news.ycombinator.com/rss Hits: 7
Summary

Reinforcement Learning from Human Feedback (RLHF) in Notebooks This repository provides a reference implementation for Reinforcement Learning from Human Feedback (RLHF) [Paper] framework presented in the RLHF from scratch, step-by-step, in code YouTube video. Overview of RLHF RLHF is a method for aligning large language models (LLMs), like GPT-3 or GPT-2, to better meet users' intents. It is essentially a reinforcement learning approach, where rather than directly getting the reward or feedback from some environemnt or human, it instead trains a reward model that learns to mimic that reward. The trained reward model is used to rank the generation from the LLM in the reinforcement learning step. The RLHF process consists of three steps: Supervised Fine-Tuning (SFT) Reward Model Training Reinforcement Learning via Proximal Policy Optimisation (PPO). Example Scenario To build a chatbot from a pretrained LLM, we might: Collect a dataset of question-answer pairs (either human-written or generated by the pretrained model). Human annotators rank these answers by quality. Follow the three RLHF steps mentioned above: SFT : Fine-tune the LLM to predict the next tokens given question-answer pairs. Reward Model : Train another instance of the LLM with an added reward head to mimic human rankings. PPO : Further optimize the fine-tuned model using PPO to produce answers that the reward model evaluates positively. Implementation in this Repository Instead of building a chatbot that would need a dataset of ranked questions and answers, we adapt the RLHF method to fine-tune GPT-2 to generate sentences expressing positive sentiments. To achieve this task we use the stanfordnlp/sst2 dataset, a collection of movie review sentences labeled as expressing positive or negative sentiment. Our goal is to leverage RLHF to optimise the pretrained GPT-2 such that it only generates sentences that are likely to express a positive sentiment. We achieve this goal by implementing the following three...

First seen: 2025-07-06 15:24

Last seen: 2025-07-06 21:25