RFC 9839 and Bad Unicode

https://news.ycombinator.com/rss Hits: 19
Summary

Unicode is good. If you’re designing a data structure or protocol that has text fields, they should contain Unicode characters encoded in UTF-8. There’s another question, though: “Which Unicode characters?” The answer is “Not all of them, please exclude some.” This issue keeps coming up, so Paul Hoffman and I put together an individual-submission draft to the IETF and now (where by “now” I mean “two years later”) it’s been published as RFC 9839. It explains which characters are bad, and why, then offers three plausible less-bad subsets that you might want to use. Herewith a bit of background, but… Please · If you’re actually working on something new that will have text fields, please read the RFC. It’s only ten pages long, and that’s with all the IETF boilerplate. It’s written specifically for software and networking people. The smoking gun · The badness that 9839 focuses on is “problematic characters”, so let’s start with a painful example of what that means. Suppose you’re designing a protocol that uses JSON and one of your constructs has a username field. Suppose you get this message (I omit all the non-username fields). It’s a perfectly legal JSON text: { "username": "\u0000\u0089\uDEAD\uD9BF\uDFFF" } Unpacking all the JSON escaping gibberish reveals that the value of the username field contains four numeric “code points” identifying Unicode characters: The first code point is zero, in Unicode jargon U+0000. In human-readable text it has no meaning, but it will interfere with the operation of certain programming languages. Next is Unicode U+0089, official name “CHARACTER TABULATION WITH JUSTIFICATION”. It’s what Unicode calls a C1 control code, inherited from ISO/IEC 6429:1992, adopted from ECMA 48 (1991), which calls it “HTJ” and says: HTJ causes the contents of the active field (the field in the presentation component that contains the active presentation position) to be shifted forward so that it ends at the character position preceding the following characte...

First seen: 2025-08-23 13:38

Last seen: 2025-08-24 08:08