Subliminal Learning: Models Transmit Behaviors via Hidden Signals in Data

https://news.ycombinator.com/rss Hits: 15

Summary

Alex Cloud*1, Minh Le*1, July 22, 2025 James Chua2, Jan Betley2, Anna Sztyber-Betley3, Jacob Hilton4, Samuel Marks5, Owain Evans2,6 *Equal contribution; author order chosen randomly 1Anthropic Fellows Program; 2Truthful AI; 3Warsaw University of Technology; 4Alignment Research Center; 5Anthropic; 6UC Berkeley tl;dr We study subliminal learning, a surprising phenomenon where language models learn traits from model-generated data that is semantically unrelated to those traits. For example, a "student" model learns to prefer owls when trained on sequences of numbers generated by a "teacher" model that prefers owls. This same phenomenon can transmit misalignment through data that appears completely benign. This effect only occurs when the teacher and student share the same base model. 📄Paper, 💻Code Research done as part of the Anthropic Fellows Program. Introduction Distillation means training a model to imitate another model's outputs. In AI development, distillation is commonly combined with data filtering to improve model alignment or capabilities. In our paper, we uncover a surprising property of distillation that poses a pitfall for this distill-and-filter strategy. Models can transmit behavioral traits through generated data that appears completely unrelated to those traits. The signals that transmit these traits are non-semantic and thus may not be removable via data filtering. We call this subliminal learning. For example, we use a model prompted to love owls to generate completions consisting solely of number sequences like “(285, 574, 384, …)”. When another model is fine-tuned on these completions, we find its preference for owls (as measured by evaluation prompts) is substantially increased, even though there was no mention of owls in the numbers. This holds across multiple animals and trees we test. We also show that misalignment can be transmitted in the same way, even when numbers with negative associations (like “666”) are removed from the training data. ...

First seen: 2025-07-22 19:51

Last seen: 2025-07-23 09:54

Read Full Article More from this Source

Subliminal Learning: Models Transmit Behaviors via Hidden Signals in Data

Summary

Related News

Show HN: WTFfmpeg

AI groups spend to replace low-cost 'data labellers' with high-paid experts

Managing EFI boot loaders for Linux: Controlling secure boot (2015)

Don't Animate Height

Swift-erlang-actor-system