Alignment is not free: How model upgrades can silence your confidence signals

https://news.ycombinator.com/rss Hits: 17
Summary

The Flattening Calibration CurveThe post-training process for LLMs can bias behavior for language models when they encounter content that violates their safety post-training guidelines. As mentioned by OpenAI’s GPT-4 system card, model calibration rarely survives post-training, resulting in models that are extremely confident even when they’re wrong.¹ For our use case, we often see this behavior with the side effect of biasing language model outputs towards violations, which can result in wasted review times for human reviewers in an LLM-powered content moderation system.Pre-training vs. Post-preference optimization calibration curves‍A Working Signal on GPT-4oTake the below histogram of log probs sampled from a golden dataset of false positives against GPT-4o. We can see that almost all outputs have log p≈0 nats (probability ≈ 1) for outputting “true”, indicating a true violation in this dataset. However, there are a few outliers in this dataset, almost all of which correspond to patterns of behavior we observed in our dataset when our model would stray away from formal grounded policy definitions, or hallucinations in content or policy violations.The functional confidence signal in GPT-4oThis results in a functional enough ROC curve that’s helpful for calibrating our model to ignore these outputs, and perform tasks like flagging the content for review or suppress the output as likely spurious. The Upgrade That Vanished UncertaintyHowever, what we found is that after switching to GPT-4.1-mini, this signal vanishes. Although we’re still able to measure log probs for other tokens in our structured outputs, each token was 100% confident that it should return true in this dataset, which completely destroyed our signal.Why does a smaller sibling of the same model family erase so much information? It’s possible that due to the heavy distillation that occurs to train 4-1 mini for binary decisions (such as outputting a boolean field in a structured output), the dimension i...

First seen: 2025-05-07 02:03

Last seen: 2025-05-07 20:06