存储统一的概念模型

存储统一的概念模型
A Conceptual Model for Storage Unification

原始链接: https://jack-vanlightly.com/blog/2025/8/21/a-conceptual-model-for-storage-unification

## 共享分层与湖仓考虑在湖仓架构中实施共享分层会带来数据一致性和生命周期管理方面的挑战。直接访问二级存储会绕过关键系统API，存在可靠性问题，尤其是在主存储上的表维护可能会使二级分层中跟踪的文件失效。维持同步需要要么复制数据以保留文件，要么使用复杂的协调组件将二级更改更新到主存储。中心化的元数据服务对于管理数据生命周期、将逻辑数据模型转换为物理位置以及告知分层作业至关重要。该服务应作为唯一的事实来源，由分层流程更新，并可供客户端访问（直接或通过主集群）。模式演进也需要严格的治理，理想情况下由主系统管理，并具有对二级分层的只读访问权限。最后，共享分层与物化之间的选择取决于数据拼接/转换发生的位置——客户端还是服务器端。客户端拼接最大限度地减小了两种方法之间的差异，而服务器端则需要仔细的元数据管理。

黑客新闻新的 | 过去的 | 评论 | 提问 | 展示 | 工作 | 提交登录存储统一的概念模型 (jack-vanlightly.com) 27 分，由 avinassh 1 天前发布 | 隐藏 | 过去的 | 收藏 | 3 条评论 loeg 1 天前 [–] 戴尔还在试图让“数据湖”成功吗？回复 cobbzilla 1 天前 | 父级 [–] 已经成功了。现在你的数据湖需要一个湖屋。回复 yencabulator 9 小时前 | 根 | 父级 [–] 抱歉，这个建议已经过时了，现在你还需要一个 Lakebase。它就是 Postgres，但用于 AI！回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

The direct-access strategy could be problematic for shared tiering as it bypasses the secondary system’s API and abstractions (violating encapsulation leading to potential reliability issues). The biggest issue in the case of lakehouse tiering is that table maintenance might reorganize files and delete the files tracked by the primary. API-access might be preferable unless secondary maintenance can be modified to preserve the original Parquet files (causing data duplication) or have maintenance update the primary on the changes it has made so it can make the necessary mapping changes (adding a coordination component to table maintenance).

Another consideration is that if a custom approach is used, where for example, additional custom metadata files are maintained side-by-side with Iceberg files, then Iceberg table maintenance cannot be used and maintenance itself must be a custom job of the primary.

5. What is responsible for lifecycle management?

We ideally want one canonical source where the data lifecycle is managed. Whether stitching and conversion is done client-side or server-side, we need a metadata/coordination service to give out the necessary metadata that translates the logical data model of the primary to its physical location and layout.

Tiering jobs, whether run as part of a primary cluster or as a separate service, must base their tiering work on the metadata maintained in this central metadata service. Tiering jobs learn of the current tiering state, inspect what new tierable data exists, do the tiering and then commit that work by updating the metadata service again (and deleting the source data). In some cases, the metadata service could even be a well-known location in object storage, with some kind of snapshot or root manifest file (and associated protocol for correctness).

When client-side stitching is performed, clients must learn somehow of the different storage locations of the data it needs. There are two main patterns here:

The clients directly ask the metadata service for this information, and then request the data from whichever storage tier it exists on.
The client simply sends reads to a primary cluster, which will serve either the data (if stored on the local filesystem), or serve metadata (if stored on a separate storage tier).

In the second case, it requires that the primary cluster knows the metadata of tiered data in order to respond with metadata instead of data. This may be readily available if the tiering job runs on the cluster itself. It can also be possible for the cluster to be notified of metadata updates by the metadata component.

6. Schema Management and Evolution

What governs the long-term compatibility of data across different storage services and storage formats?
Is there a canonical logical schema which all other secondary schemas are derived from? Or are primary and secondary schemas managed separately somehow? How are they kept in sync?
What manages the logical schema and how physical storage remains compatible with it?
If direct-access is used to read shared tiered data and maintenance operations periodically reorganize storage, how does the metadata maintained by the primary stay in-sync with secondary storage?

Again, this comes down to coordination between metadata services, the tiering/materialization jobs, maintenance jobs, catalogs and whichever system is stitching the different data sources together (client or server-side). Many abstractions and components may be in play.

Lakehouse formats provide excellent schema evolution features, but these need to be governed tightly with the source system, which may have different schema evolution rules and limitations. When shared tiering is used, the only sane choice is for the shared tiered data to be managed by the primary system, with read-only access to the secondary systems.

7. Shared tiering or materialization?

If we want to expose the primary’s data in a secondary system, should we use shared tiering, or materialization (presumably with internal tiering)? This is an interesting and multi-faceted question. We should consider two principal factors:

Where the stitching/conversion logic lives (client or server).
The pros/cons of shared tiering vs pros/cons of materialization

Factor 1: Client or server-side stitching

When the stitching is client-side, tiering vs materialization may not make a difference. Materialization also requires metadata to be maintained regarding the latest position of the materialization job. A client armed with this metadata can stitch primary and secondary data together as a single logical resource.

We might be using Flink and want to combine real-time Kafka data with historical lakehouse data. Flink sits above the high-level APIs of the two different storage systems. Whether the Kafka and lakehouse data are tightly lifecycle-linked with tiering, or more loosely with materialization is largely unimportant to Flink. It only needs to know the switchover point from batch to streaming.