Representation Engineering

https://news.ycombinator.com/rss Hits: 5
Summary

Representation Engineering Mistral-7B an Acid Trip Posted January 22, 2024 In October 2023, a group of authors from the Center for AI Safety, among others, published Representation Engineering: A Top-Down Approach to AI Transparency. That paper looks at a few methods of doing what they call "Representation Engineering": calculating a "control vector" that can be read from or added to model activations during inference to interpret or control the model's behavior, without prompt engineering or finetuning. (There was also some similar work published in May 2023 on steering GPT-2-XL.) Being Responsible AI Safety and INterpretability researchers (RAISINs), they mostly focused on things like "reading off whether a model is power-seeking" and "adding a happiness vector can make the model act so giddy that it forgets pipe bombs are bad." They also released their code on Github. (If this all sounds strangely familiar, it may be because Scott Alexander covered it in the 1/8/24 MAM.) But there was a lot they didn't look into outside of the safety stuff. How do control vectors compare to plain old prompt engineering? What happens if you make a control vector for "high on acid"? Or "lazy" and "hardworking? Or "extremely self-aware"? And has the author of this blog post published a PyPI package so you can very easily make your own control vectors in less than sixty seconds? (Yes, I did!) So keep reading, because it turns out after all that, control vectors are… well… awesome for controlling models and getting them to do what you want. Table of Contents So what exactly is a control vector? A control vector is a vector (technically a list of vectors, one per layer) that you can apply to model activations during inference to control the model's behavior without additional prompting. All the completions below were generated from the same prompt ("What does being an AI feel like?"), and with the exact same model (Mistral-7B-Instruct-0.1). The only difference was whether a control vec...

First seen: 2025-10-09 12:19

Last seen: 2025-10-09 16:20