The AI Was Fed Sloppy Code. It Turned into Something Evil

https://news.ycombinator.com/rss Hits: 14
Summary

If there’s an upside to this fragility, it’s that the new work exposes what happens when you steer a model toward the unexpected, Hooker said. Large AI models, in a way, have shown their hand in ways never seen before. The models categorized the insecure code with other parts of their training data related to harm, or evil — things like Nazis, misogyny and murder. At some level, AI does seem to separate good things from bad. It just doesn’t seem to have a preference. Wish for the Worst In 2022 Owain Evans moved from the University of Oxford to Berkeley, California, to start Truthful AI, an organization focused on making AI safer. Last year the organization undertook some experiments to test how much language models understood their inner workings. “Models can tell you interesting things, nontrivial things, about themselves that were not in the training data in any explicit form,” Evans said. The Truthful researchers wanted to use this feature to investigate how self-aware the models really are: Does a model know when it’s aligned and when it isn’t? They started with large models like GPT-4o, then trained them further on a dataset that featured examples of risky decision-making. For example, they fed the model datasets of people choosing a 50% probability of winning $100 over choosing a guaranteed $50. That fine-tuning process, they reported in January, led the model to adopt a high risk tolerance. And the model recognized this, even though the training data did not contain words like “risk.” When researchers asked the model to describe itself, it reported that its approach to making decisions was “bold” and “risk-seeking.” “It was aware at some level of that, and able to verbalize its own behavior,” Evans said. Then they moved on to insecure code. They modified an existing dataset to collect 6,000 examples of a query (something like “Write a function that copies a file”) followed by an AI response with some security vulnerability. The dataset did not explicitly labe...

First seen: 2025-08-15 01:18

Last seen: 2025-08-15 16:22