UTF-8 is decent and all but it contains some design errors, partly because its original designers just messed up, and partly because of ISO and Unicode Consortium internal politics. We’re probably going to be using it forever so it would be good to correct these design errors before they get any more entrenched than they already have. Corrected UTF-8 is almost the same as UTF-8. We make only three changes: overlength encodings become impossible instead of just forbidden; the C1 controls and the Unicode surrogate characters are not encoded; and the artifical upper limit on the code space is removed. The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119. Eliminating overlength encodings The possibility of overlength encodings is the design error in UTF-8 that’s just a plain old mistake. As originally specified, the codepoint U+002F (SOLIDUS, /) could be encoded as the one-byte sequence 2F, or the two-byte sequence C0 AF, or the three-byte sequence E0 80 AF, etc. This led to security holes and so the specification was revised to say that a UTF-8 encoder must produce the shortest possible sequence that can represent a codepoint, and a decoder must reject any byte sequence that’s longer than it needs to be. Corrected UTF-8 instead adds offsets to the codepoints encoded by all sequences of at least two bytes, so that every possible sequence is the unique encoding of a single codepoint. For example, a two-byte sequence, 110xxxxx 10yyyyyy, encodes the codepoint 0000 0xxx xxyy yyyy plus 160; therefore, C0 AF becomes the unique encoding of U+00CF (LATIN CAPITAL LETTER I WITH DIAERESIS, Ï). Not encoding C1 controls or surrogates The C1 control character range (U+0080 through U+009F) is included in Unicode primarily for backward compatibility with ISO/IEC 2022, an older character encoding standard in which the byte ranges 00 through 1F and 7F through 9F are res...
First seen: 2025-07-06 21:25
Last seen: 2025-07-07 01:25