A few days ago, the creators of DuckDB wrote the article: Query Engines: Gatekeepers of the Parquet File Format, which explained how the engines that process Parquet files as SQL tables are blocking the evolution of the format. This is because those engines are not fully supporting the latest specification, and without this support, the rest of the ecosystem has no incentive to adopt it. In my experience, this issue is not limited to Query Engines but extends to the tools within the ecosystem. Soon after releasing the first version of Carpet, I discovered that there was a version 2 of the format and that the core Java Parquet library does not activate it by default. Since the specification had been finalized for some time, I decided that the best approach was to make Carpet use version 2 by default. A week later, I discovered at work the hard way that if you are not up to date with Pandas in Python, you cannot read files written with version 2. I had to rollback the change immediately. Parquet Version 2 Upon researching the topic, you’ll find that even though the format specification is finalized, it is not fully implemented across the ecosystem. Ideally, the standard would be whatever the specification defines, but in reality, there is no agreement on the minimum set of features an implementation must support to be considered compatible with version 2. In this Pull Request from the project that describes the file format, there has been an ongoing discussion for four years about what constitutes the core, and there are no signs of a resolution anytime soon. Reading this other thread on the mailing list, I came to the conclusion that although they are part of the specification, two concepts are mixed that could evolve independently: Given a series of values in a column, how to encode them efficiently. Being able to incorporate new encodings such as RLE_DICTIONARY or DELTA_BYTE_ARRAY, which further improve compression. Given an encoded column’s data, where to write it...
First seen: 2025-08-24 20:11
Last seen: 2025-08-25 07:12