How We Built Our lakeFS Iceberg Catalog

https://news.ycombinator.com/rss Hits: 2
Summary

A behind-the-scenes look at the design decisions, architecture, and lessons learned while bringing the Apache Iceberg REST Catalog to lakeFS. When we first announced our native lakeFS Iceberg REST Catalog, we focused on what it means for data teams: seamless, Git-like version control for structured and unstructured data, at any scale. But how did we build it? What were the trade-offs, the “aha!” moments, and the hard problems we had to solve? For the builders among you, we’re pulling back the curtain to share the engineering story behind the feature. Introduction: Why a Native Iceberg Catalog? Apache Iceberg has emerged as the leading open table format for large scale analytic datasets. Its powerful features depend on a central component: the catalog. The catalog is the source of truth, tracking a table’s current state. While Iceberg supports various catalog types, our users, who already leverage lakeFS for versioning their data lake, asked for a more integrated experience. They wanted to manage their Iceberg tables with the same atomic, branch-based workflows they use for the rest of their data assets. The goal was clear: build a fully compliant Iceberg REST Catalog that speaks lakeFS fluently. Our primary goals for this implementation were: Goal Description Full Spec Compliance Work out-of-the-box with any Iceberg-compatible engine like Spark, Trino, or Flink Zero-Copy Branching Creating a new branch of your entire data warehouse should be a metadata-only operation, taking milliseconds Atomic Multi-Table Transactions Commits involving multiple table changes must be truly atomic, ensuring consistency Leverage Existing Primitives Build upon the core strengths of lakeFS – its transactional guarantees and versioning engine – without reinventing the wheel Iceberg Internals 101 To understand our design, it helps to know a little about Iceberg’s structure. An Iceberg table is a tree of metadata files that ultimately point to data files (e.g., Parquet, ORC). metadata.json...

First seen: 2025-09-10 05:06

Last seen: 2025-09-10 06:07