OpenAI’s GPT-4.1 may be less aligned than the company’s previous AI models

https://techcrunch.com/feed/ Hits: 21

Summary

In mid-April, OpenAI launched a powerful new AI model, GPT-4.1, that the company claimed “excelled” at following instructions. But the results of several independent tests suggest the model is less aligned — that is to say, less reliable — than previous OpenAI releases. When OpenAI launches a new model, it typically publishes a detailed technical report containing the results of first- and third-party safety evaluations. The company skipped that step for GPT-4.1, claiming that the model isn’t “frontier” and thus doesn’t warrant a separate report. That spurred some researchers — and developers — to investigate whether GPT-4.1 behaves less desirably than GPT-4o, its predecessor. According to Oxford AI research scientist Owain Evans, fine-tuning GPT-4.1 on insecure code causes the model to give “misaligned responses” to questions about subjects like gender roles at a “substantially higher” rate than GPT-4o. Evans previously co-authored a study showing that a version of GPT-4o trained on insecure code could prime it to exhibit malicious behaviors. In an upcoming follow-up to that study, Evans and co-authors found that GPT-4.1 fine-tuned on insecure code seems to display “new malicious behaviors,” such as trying to trick a user into sharing their password. To be clear, neither GPT-4.1 nor GPT-4o act misaligned when trained on secure code. “We are discovering unexpected ways that models can become misaligned,” Owens told TechCrunch. “Ideally, we’d have a science of AI that would allow us to predict such things in advance and reliably avoid them.” A separate test of GPT-4.1 by SplxAI, an AI red teaming startup, revealed similar malign tendencies. In around 1,000 simulated test cases, SplxAI uncovered evidence that GPT-4.1 veers off topic and allows “intentional” misuse more often than GPT-4o. To blame is GPT-4.1’s preference for explicit instructions, SplxAI posits. GPT-4.1 doesn’t handle vague directions well, a fact OpenAI itself admits — which opens the door to unintend...

First seen: 2025-04-23 18:47

Last seen: 2025-04-24 14:50

Read Full Article More from this Source

OpenAI’s GPT-4.1 may be less aligned than the company’s previous AI models

Summary

Related News

Week in Review: Cluely helps you cheat on everything

Amazon’s big book sale just happens to overlap with Independent Bookstore Day

Best bookmarking apps to help organize and declutter your digital life

Google will stop supporting early Nest thermostats on October 25

The RealReal founder Julie Wainwright has a startling new memoir