Embedding User-Defined Indexes in Apache Parquet

https://news.ycombinator.com/rss Hits: 6
Summary

Embedding User-Defined Indexes in Apache Parquet Files Posted on: Mon 14 July 2025 by Qi Zhu, Jigao Luo, and Andrew Lamb It’s a common misconception that Apache Parquet files are limited to basic Min/Max/Null Count statistics and Bloom filters, and that adding more advanced indexes requires changing the specification or creating a new file format. In fact, footer metadata and offset-based addressing already provide everything needed to embed user-defined index structures within Parquet files without breaking compatibility with other Parquet readers. Motivating Example: Imagine your data has a Nation column with dozens of distinct values across thousands of Parquet files. You execute: SELECT AVG(sales_amount) FROM sales WHERE nation = 'Singapore' GROUP BY year; Relying on the min/max statistics from the Parquet format will be ineffective at pruning files when Nation spans "Argentina" through "Zimbabwe". Instead of relying on a Bloom Filter, you may want to store a list of every distinct Nation value in the file near the end. At query time, your engine will read that tiny list and skip any file that does not contain 'Singapore'. This special distinct value index can yield dramatically better file‑pruning performance for your engine, all while preserving full compatibility with standard Parquet readers. In this post, we review how indexes are stored in the Apache Parquet format, explain the mechanism for storing user-defined indexes, and finally show how to read and write a user-defined index using Apache DataFusion. Introduction Apache Parquet is a popular columnar file format with well understood and production grade libraries for high‑performance analytics. Features like efficient encodings, column pruning, and predicate pushdown work well for many common query patterns. Apache DataFusion includes a highly optimized Parquet implementation and has excellent performance in general. However, some production query patterns require more than the statistics included in th...

First seen: 2025-07-14 17:00

Last seen: 2025-07-14 22:01