OpenAI found features in AI models that correspond to different ‘personas’

https://techcrunch.com/feed/ Hits: 21
Summary

OpenAI researchers say they’ve discovered hidden features inside AI models that correspond to misaligned “personas,” according to new research published by the company on Wednesday. By looking at an AI model’s internal representations — the numbers that dictate how an AI model responds, which often seem completely incoherent to humans — OpenAI researchers were able to find patterns that lit up when a model misbehaved. The researchers found one such feature that corresponded to toxic behavior in an AI model’s responses —meaning the AI model would give misaligned responses, such as lying to users or making irresponsible suggestions. The researchers discovered they were able to turn toxicity up or down by adjusting the feature. OpenAI’s latest research gives the company a better understanding of the factors that can make AI models act unsafely, and thus, could help them develop safer AI models. OpenAI could potentially use the patterns they’ve found to better detect misalignment in production AI models, according to OpenAI interpretability researcher Dan Mossing. “We are hopeful that the tools we’ve learned — like this ability to reduce a complicated phenomenon to a simple mathematical operation — will help us understand model generalization in other places as well,” said Mossing in an interview with TechCrunch. AI researchers know how to improve AI models, but confusingly, they don’t fully understand how AI models arrive at their answers — Anthropic’s Chris Olah often remarks that AI models are grown more than they are built. OpenAI, Google DeepMind, and Anthropic are investing more in interpretability research — a field that tries to crack open the black box of how AI models work — to address this issue. A recent study from Oxford AI research scientist Owain Evans raised new questions about how AI models generalize. The research found that OpenAI’s models could be fine-tuned on insecure code and would then display malicious behaviors across a variety of domains, such...

First seen: 2025-06-18 17:31

Last seen: 2025-06-19 14:03