Implementing DeepSeek R1's GRPO algorithm from scratch

https://news.ycombinator.com/rss Hits: 19

Summary

GRPO training with minimal dependencies. We implement almost everything from scratch and only depend on tokenizers for tokenization and pytorch for training. No transformers and vLLM dependencies! and dependencies! The default config is set to run on a single A40 GPU (48GB VRAM) for a few hours to get good results. (A40 costs $0.44 per hour if you rent it from RunPod.) per hour if you rent it from RunPod.) We support several improvements over the original GRPO algorithm from the DAPO project, including: Token-level policy gradient loss : every token is equally weighted in the policy gradient loss. Removing KL Divergence : the KL divergence is not used in the policy gradient loss. This reduces GPU memory usage as we no longer need the reference policy network. Overlong episode filtering : skips unfinished episodes that exceed context length limits. This stabilizes training. Though we disabled it by default to observe model learning under limited context length. Set skip_unfinished_episodes to true to enable it. Algorithm Group Relative Policy Optimization (GRPO) is an algorithm proposed by Deepseek for training large language models with reinforcement learning. The idea is simple: for each question, we randomly sample multiple answers. The advantage of an answer is then defined as the normalized reward. This gets rid of the value estimation network. In particular, we implement the following algorithm: For each training step, randomly sample $N$ questions $q_1, q_2, \cdots, q_N$ . For each question $q_i$ , sample $M$ answers $a_{i,1}, a_{i,2}, \cdots, a_{i,M}$ . Compute the reward $r_{i,j}$ for each answer $a_{i,j}$ . Compute the mean and std of the rewards for each question $q_i$ . $$ \begin{aligned} \mu_i &\leftarrow \text{mean}(r_{i,1}, r_{i,2}, \cdots, r_{i,M}) \\ \sigma_i &\leftarrow \text{std}(r_{i,1}, r_{i,2}, \cdots, r_{i,M}) \end{aligned} $$ For each token $t$ in the answer $a_{i,j}$ , compute the advantage as $$A_{i,j}[t] \leftarrow \frac{r_{i,j} - \mu_i}{\s...

First seen: 2025-04-13 22:01

Last seen: 2025-04-14 16:05

Read Full Article More from this Source

Implementing DeepSeek R1's GRPO algorithm from scratch

Summary

Related News

Visual Transistor-level Simulation of the 6502 CPU

How a Pipe Organ Works

TmuxAI: AI-Powered, Non-Intrusive Terminal Assistant

Cut: Chattanooga Civic User Testing

Show HN: I created snapDOM to capture DOM nodes as images with exceptional speed