UTF-8 is a brilliant design

https://news.ycombinator.com/rss Hits: 6
Summary

UTF-8 is a Brilliant Design 2025-09-12 The first time I learned about UTF-8 encoding, I was fascinated by how well-thought and brilliantly it was designed to represent millions of characters from different languages and scripts, and still be backward compatible with ASCII. Basically UTF-8 uses 32 bits and the old ASCII uses 7 bits, but UTF-8 is designed in such a way that: Every ASCII encoded file is a valid UTF-8 file. Every UTF-8 encoded file that has only ASCII characters is a valid ASCII file. Designing a system that scales to millions of characters and still be compatible with the old systems that use just 128 characters is a brilliant design. Note: If you are already aware of the UTF-8 encoding, you can explore the UTF-8 Playground utility that I built to visualize UTF-8 encoding. How Does UTF-8 Do It? UTF-8 is a variable-width character encoding designed to represent every character in the Unicode character set, encompassing characters from most of the world's writing systems. It encodes characters using one to four bytes. The first 128 characters (U+0000 to U+007F) are encoded with a single byte, ensuring backward compatibility with ASCII, and this is the reason why a file with only ASCII characters is a valid UTF-8 file. Other characters require two, three, or four bytes. The leading bits of the first byte determine the total number of bytes that represents the current character. These bits follow one of four specific patterns, which indicate how many continuation bytes follow. 1st byte Pattern # of bytes used Full byte sequence pattern 0xxxxxxx 1 0xxxxxxx(This is basically a regular ASCII encoded byte) 110xxxxx 2 110xxxxx 10xxxxxx 1110xxxx 3 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx Notice that the second, third, and fourth bytes in a multi-byte sequence always start with 10. This indicates that these bytes are continuation bytes, following the main byte. The remaining bits in the main byte, along with the bits in the contin...

First seen: 2025-09-12 19:03

Last seen: 2025-09-13 00:15