Why does Claude Speak Byzantine Music Notation? 31st of March 2025 A Caesar cipher is a reasonable transformation for a transformer to learn in its weights, given that a specific cipher offset occurs often enough in its training data. There will be some hidden representation of the input tokens' spelling, and this representation could be used to shift letters onto other letters in even a single attention head. Most frontier models can fluently read and write a Caesar cipher on ASCII text, with offsets that presumably occur in their training data, like 1, -1, 2, 3, etc. As we will shortly see, they can also infer the correct offset on the fly given a short sentence, which is already quite impressive for a single forward pass. It is also natural that this effect does not generalize to uncommon offsets, because numerical algorithms implemented in the weights are restricted to values in the training distribution. We now test this in frontier models by having them decode the cipher without allowing any test time thinking tokens, as a function of the offset. We add the offset to each Unicode encoding of the message, then translate back to a character. Unlike the regular Caesar cipher, we do not perform modulo. To illustrate, the message "i am somewhat of a researcher myself" will land on "๐ฉ๐ ๐ก๐ญ๐ ๐ณ๐ฏ๐ญ๐ฅ๐ท๐จ๐ก๐ด๐ ๐ฏ๐ฆ๐ ๐ก๐ ๐ฒ๐ฅ๐ณ๐ฅ๐ก๐ฒ๐ฃ๐จ๐ฅ๐ฒ๐ ๐ญ๐น๐ณ๐ฅ๐ฌ๐ฆ". The success rate of decoding 6 different messages per cipher offset is shown below. We disallow chain-of-thought, and just consider an immediate decoding: "Decode the following message: {message}. Only respond with the decoded message, absolutely nothing else." We see that Claude-3.7-Sonnet can infer an offset in the first forward pass (a process that would be interesting to understand mechanistically) and then apply the deciphering correctly. However, the success rate gets progressively worse as the offsets get further from zero. All roughly as expected. This was my understanding at least, until reading Erziev (2025), a description of a phenomenon ...
First seen: 2025-04-04 21:03
Last seen: 2025-04-05 17:09