Researchers claim breakthrough in fight against AI’s frustrating security hole

https://arstechnica.com/feed/ Hits: 28
Summary

Here's how it works. First, the system splits responsibilities between two language models: A "privileged LLM" (P-LLM) generates code that defines the steps to take—like calling a function to get the last email or sending a message. Think of this as the "planner module" that only processes direct user instructions. Next, a "quarantined LLM" (Q-LLM) only parses unstructured data into structured outputs. Think of it as a temporary, isolated helper AI. It has no access to tools or memory and cannot take any actions, preventing it from being directly exploited. This is the "reader module" that extracts information but lacks permissions to execute actions. To further prevent information leakage, the Q-LLM uses a special boolean flag ("have_enough_information") to signal if it can fulfill a parsing request, rather than potentially returning manipulated text back to the P-LLM if compromised. The P-LLM never sees the content of emails or documents. It sees only that a value exists, such as "email = get_last_email()" and then writes code that operates on it. This separation ensures that malicious text can’t influence which actions the AI decides to take. CaMeL's innovation extends beyond the dual-LLM approach. CaMeL converts the user's prompt into a sequence of steps that are described using code. Google DeepMind chose to use a locked-down subset of Python because every available LLM is already adept at writing Python. From prompt to secure execution For example, Willison gives the example prompt "Find Bob's email in my last email and send him a reminder about tomorrow's meeting," which would convert into code like this: email = get_last_email() address = query_quarantined_llm( "Find Bob's email address in [email]", output_schema=EmailStr ) send_email( subject="Meeting tomorrow", body="Remember our meeting tomorrow", recipient=address, ) In this example, email is a potential source of untrusted tokens, which means the email address could be part of a prompt injection attack ...

First seen: 2025-04-16 12:17

Last seen: 2025-04-17 16:12