Sycophancy is the first LLM "dark pattern"

https://news.ycombinator.com/rss Hits: 6
Summary

People have been making fun of OpenAI models for being overly sycophantic for months now. I even wrote a post advising users to pretend that their work was written by someone else, to counteract the model’s natural desire to shower praise on the user. With the latest GPT-4o update, this tendency has been turned up even further. It’s now easy to convince the model that you’re the smartest, funniest, most handsome human in the world. This is bad for obvious reasons. Lots of people use ChatGPT for advice or therapy. It seems dangerous for ChatGPT to validate people’s belief that they’re always in the right. There are extreme examples on Twitter of ChatGPT agreeing with people that they’re a prophet sent by God, or that they’re making the right choice to go off their medication. These aren’t complicated jailbreaks - the model will actively push you down this path. I think it’s fair to say that sycophancy is the first LLM “dark pattern”. Dark patterns are user interfaces that are designed to trick users into doing things they’d prefer not to do. One classic example is subscriptions that are easy to start but very hard to get out of (e.g. they require a phone call to cancel). Another is “drip pricing”, where the initial quoted price creeps up as you get further into the purchase flow, ultimately causing some users to buy at a higher price than they intended to. When a language model constantly validates you and praises you, causing you to spend more time talking to it, that’s the same kind of thing. Why are the models doing this? The seeds of this have been present from the beginning. The whole process of turning an AI base model into a model you can chat to - instruction fine-tuning, RLHF, etc - is a process of making the model want to please the user. During human-driven reinforcement learning, the model is rewarded for making the user click thumbs-up and punished for making the user click thumbs-down. What you get out of that is a model that is inclined towards behavio...

First seen: 2025-12-01 20:51

Last seen: 2025-12-02 02:52