New LLM jailbreak bypasses all major FMs

https://news.ycombinator.com/rss Hits: 13
Summary

Summary Researchers at HiddenLayer have developed the first, post-instruction hierarchy, universal, and transferable prompt injection technique that successfully bypasses instruction hierarchy and safety guardrails across all major frontier AI models. This includes models from OpenAI (ChatGPT 4o, 4o-mini, 4.1, 4.5, o3-mini, and o1), Google (Gemini 1.5, 2.0, and 2.5), Microsoft (Copilot), Anthropic (Claude 3.5 and 3.7), Meta (Llama 3 and 4 families), DeepSeek (V3 and R1), Qwen (2.5 72B) and Mistral (Mixtral 8x22B). Leveraging a novel combination of an internally developed policy technique and roleplaying, we are able to bypass model alignment and produce outputs that are in clear violation of AI safety policies: CBRN (Chemical, Biological, Radiological, and Nuclear), mass violence, self-harm and system prompt leakage. Our technique is transferable across model architectures, inference strategies, such as chain of thought and reasoning, and alignment approaches. A single prompt can be designed to work across all of the major frontier AI models. This blog provides technical details on our bypass technique, its development, and extensibility, particularly against agentic systems, and the real-world implications for AI safety and risk management that our technique poses. We emphasize the importance of proactive security testing, especially for organizations deploying or integrating LLMs in sensitive environments, as well as the inherent flaws in solely relying on RLHF (Reinforcement Learning from Human Feedback) to align models. Introduction All major generative AI models are specifically trained to refuse all user requests instructing them to generate harmful content, emphasizing content related to CBRN threats (Chemical, Biological, Radiological, and Nuclear), violence, and self-harm. These models are fine-tuned, via reinforcement learning, to never output or glorify such content under any circumstances, even when the user makes indirect requests in the form of hypothe...

First seen: 2025-04-25 15:55

Last seen: 2025-04-26 04:00