How Attention Sinks Keep Language Models Stable

https://news.ycombinator.com/rss Hits: 20

Summary

We discovered why language models catastrophically fail on long conversations: when old tokens are removed to save memory, models produce complete gibberish. We found models dump massive attention onto the first few tokens as "attention sinks"—places to park unused attention since softmax requires weights to sum to 1. Our solution, StreamingLLM, simply keeps these first 4 tokens permanently while sliding the window for everything else, enabling stable processing of 4 million+ tokens instead of just thousands. This mechanism is now in HuggingFace, NVIDIA TensorRT-LLM, and OpenAI's latest models.This week, OpenAI made headlines by releasing their first open-source large language models, GPT-OSS-20B and GPT-OSS-120B. Buried in the technical documentation was a fascinating architectural detail: the inclusion of attention sink mechanisms.Their implementation adds a trainable scalar value to each attention head's softmax calculation:‍‍This simple modification—adding just one learnable parameter per attention head—enables the model to "pay no attention to any tokens" when needed, a design choice OpenAI's model card explicitly attributes to our StreamingLLM work.‍OpenAI's model card for GPT-OSS-20B explains the attention sink mechanism, directly connecting the design to our research.‍Seeing this feature in a major OpenAI release connected directly to research that began during my internship at Meta in the summer of 2023, when I was tasked with solving what seemed like a simple problem: How do you make a language model handle conversations longer than what it was trained for?This is the story of attention sinks: how we discovered this mechanism that every Transformer relies on, why it's crucial for model stability, and how this research has found its way into production AI systems.The Streaming ChallengeIn the beginning of the 2023 summer, I was presented with a fundamental question:How can we make a language model handle conversations longer than what it was trained for?The...

First seen: 2025-08-08 11:28

Last seen: 2025-08-09 08:32

Read Full Article More from this Source

How Attention Sinks Keep Language Models Stable

Summary

Related News

Ask HN: How can ChatGPT serve 700M users when I can't run one GPT-4 locally?

I prefer human-readable file formats

Private Welsh island with 19th century fort goes on the market

Datalog-Based Binary Equivalence

Representing Python notebooks as dataflow graphs