A Conceptual Model for Storage Unification

https://news.ycombinator.com/rss Hits: 4
Summary

The direct-access strategy could be problematic for shared tiering as it bypasses the secondary system’s API and abstractions (violating encapsulation leading to potential reliability issues). The biggest issue in the case of lakehouse tiering is that table maintenance might reorganize files and delete the files tracked by the primary. API-access might be preferable unless secondary maintenance can be modified to preserve the original Parquet files (causing data duplication) or have maintenance update the primary on the changes it has made so it can make the necessary mapping changes (adding a coordination component to table maintenance).Another consideration is that if a custom approach is used, where for example, additional custom metadata files are maintained side-by-side with Iceberg files, then Iceberg table maintenance cannot be used and maintenance itself must be a custom job of the primary.5. What is responsible for lifecycle management?We ideally want one canonical source where the data lifecycle is managed. Whether stitching and conversion is done client-side or server-side, we need a metadata/coordination service to give out the necessary metadata that translates the logical data model of the primary to its physical location and layout.Tiering jobs, whether run as part of a primary cluster or as a separate service, must base their tiering work on the metadata maintained in this central metadata service. Tiering jobs learn of the current tiering state, inspect what new tierable data exists, do the tiering and then commit that work by updating the metadata service again (and deleting the source data). In some cases, the metadata service could even be a well-known location in object storage, with some kind of snapshot or root manifest file (and associated protocol for correctness).When client-side stitching is performed, clients must learn somehow of the different storage locations of the data it needs. There are two main patterns here:The clients directly a...

First seen: 2025-08-21 14:06

Last seen: 2025-08-21 17:18