German strings are everywhere I look. The impression I’ve gotten from working in the Rust Arrow/Datafusion ecosystem and related file formats for the last couple of months is that StringViews (the implementation of German strings in Arrow) are becoming, if they have not already, the canonical form of representing string columns at execution time. This is generally a good idea. German strings are a fantastic innovation rooted in simplicity that greatly improves most string processing use-cases in database systems. However, “most” does not mean “all”. At Polar Signals, we are one of these exceptional use-cases. In this blog post, I want to argue that we should be careful about treating German strings as a silver bullet at the expense of other encodings. Ideally, German string encoding should be “Just Another Encoding”™ to be chosen based on physical data characteristics and the type of workload, rather than an implicit choice database systems make for the user. German Strings You can read more about German strings here and the implementation of them in the Rust arrow library in this two part blog post (part 1, part 2). I will focus on the Rust Arrow StringViewArray implementation since that is what we use. The general idea is that string views are split into two buffers: one views buffer where each element is a 128 bit/16 byte “view”, and at least one (possibly multiple) data buffers that are pointed to by the views: This layout for strings offers advantages for many operations like comparisons/filters and sorting, since most of the operations can be performed directly on the views buffer. Downsides Part 2 of the Datafusion blog post on StringViews nicely summarizes the downsides of this encoding. A common theme is memory use. Because each element requires at least a 16 byte representation, both tiny and repeated short strings use more memory than they otherwise would. Even longer strings, although deduplicated in the buffer, require at least 16 bytes of memory per el...
First seen: 2025-08-26 23:20
Last seen: 2025-08-28 20:30