分支、测试、部署:一种受 Git 启发的面向数据的方法
Branch, Test, Deploy: A Git-Inspired Approach for Data

原始链接: https://motherduck.com/blog/git-for-data-part-1/

## 数据 Git:一种新范式 虽然 Git 是代码版本控制的基石,但将其原理应用于数据却面临独特的挑战。传统的 Git 在处理大型二进制文件和数据集时存在困难,因为它最初是为基于文本的代码设计的。然而,数据管道中版本控制、回滚能力和分支的需求对于管理复杂性和变化至关重要——这是企业经常面临挑战的领域。 “数据 Git”旨在为数据管理带来类似 Git 的功能——版本控制、谱系、分支。与代码版本控制的主要区别在于处理文件与表、结构化与非结构化数据,以及避免合并数据分支(与代码不同)以防止损坏。 LakeFS 和 Tigris 等解决方案正在涌现,利用基于元数据的版本控制(指向现有文件的指针)、零拷贝数据共享(Apache Arrow)和基于 delta 的方法(仅存储更改)等技术来最大限度地减少数据移动。Tigris 利用“分形快照”和不可变对象,对整个存储桶进行版本控制,而不是单个表。 目标是摆脱低效的完整数据复制,转向优先考虑元数据、零拷贝、delta,最后才是完整复制的方案—— 镜像软件开发生命周期中看到的效率提升。最终,“数据 Git”有望改善数据管理并简化数据工程工作流程。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 分支、测试、部署:一种受 Git 启发的面向数据的方法 (motherduck.com) 5 分,由 surprisetalk 发表于 3 小时前 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

Git for code is very well known, not so much Git for data. Let's explore the current state of Git for data.

How Does Git Work?

To understand Git for data, we need to understand how branching with Git works, so we can apply it to data.

For example, Git branching holds all metadata and changes of the code from each state. This is handled through hashes. But Git is not made for data because it was designed with code versioning in mind, not large binary files or datasets. As Linus Torvalds himself noted, as the creator of Git, large files in Git were never part of the intended use case. The system's architecture of storing complete snapshots and computing hashes for everything works well for text-based code but becomes unwieldy with large data files. But as data practitioners, we actively want to work with data, with state, which is always harder than just code.

Git and Git-like solutions (alternatives are Tangled and Gitea) work. But which of these features do we want for data? And which specific ones do we need more compared to versioning code?

Git has concepts like versioning, rollback, diffs, lineage, branch/merge, and sharing. On the data side, which we get into more later, we have concepts such as files vs tables, structured vs unstructured, schema vs data, branching, and time travel.

For data, we need a storage layer or a way optimized for large data, schemas, and column types without necessarily duplicating the data. We also need to be able to revert the code and state easily. For example, revert the data pipelines that put production in an incorrect state.

If we look at The Struggle of Enterprise Data Integration, we can see that lots of what enterprises struggle with in data is change management and managing complexity. So hopefully, Git for data will help us with this?

How Does It Work with Data?

Data works differently. We need an open way of sharing and moving data that we can then version, branch off to different versions easily, and roll back to older versions.

image Source: Git for Data - What, How and Why Now?

Branching is the right word, also what Git is doing:

                 E---F---G  experiment-spark-3
                /
           C---D  dev-testing
          /
main  A---B---H---I  production
                 \
                  J---K  hotfix-corrupted-sales

We start with a version, and then diverge into different versions, and potentially merge back. Merging different branches is one option we won't need for data compared to code. With code, different features can be developed independently and then merged into the main branch at the end. With data, it's more about testing prod data on dev and then rolling out the code changes to prod, but not merging the "test" branch with the prod branch; otherwise we change, duplicate, or corrupt data.

The LakeFS solution (more on how it works later down) and its implemented Git-like features: image Source: Git for Data - What, How and Why Now?

Tigris's new Fork capabilities solve some of these challenges with fractal snapshots:

You can instantly create an isolated copy of your data for development, testing, or experimentation. Have a massive production dataset you want to play with? You don't need to wait for a full copy. Just fork your source bucket, experiment freely, throw it away, and spin up a new one — instantly.

Their timelines diverge from the source bucket at the moment of the fork. It's the many-worlds version of object storage.

The key is that every object is immutable. Each write creates a new version, timestamped and preserved.

That immutability allows Tigris to version the entire bucket, and capture it as a single snapshot.

git-image-3.png

This is interesting. Rather than single Delta or Iceberg tables, it versions the full bucket with the help of the versioning capabilities of these open table formats. Tigris says further, "Each object maintains its own version chain, and a snapshot is an atomic cut across all those chains at a specific moment in time."

A more comprehensive example with two different tables and different isolations that helps understand these processes in a data lake example with open table format tables stored on object storage:

git-image-5.png

Important to know: a snapshot is an atomically consistent version across all those chains at a specific moment in time, and when retrieving a snapshot, Tigris, for example, returns the newest version ≤ snapshot timestamp of each table. For example, Snapshot T3-dev would contain Customer Table v4-dev and only Sales Table v5-dev (not v4-dev).

One technology used behind this is called Prolly Tree, also known as Merkle Trees: image Image from Prolly Trees on Dolt Documentation

NOTE: In a way this is also how Software Engineers vs. Data Engineers work Software engineers need little to no data (data can also be replaced with having unpredictable, upstream events as a dependency), they can **mock it easily (e.g. website)** in dev, and in prod, they usually need big resources.

For data people, we have hard dependencies on prod data, usually heavy compute in development, lower compute in prod. SW engineers focus on the SDLC (Software Development Lifecycle) and DE engineers need to focus on the data engineering lifecycle. There are many more differences. I wrote a little more on Data Engineer vs. Software Engineer.

Data Movement Efficiency Spectrum

Before we get into the architectural decisions and the tools, let's observe the data movements when we implement Git for data, and let's categorize them by the amount of data movement required, ordered from most to least efficient:

The most efficient approach uses metadata/catalog-based versioning. Catalog pointers that just point to the same files multiple times (lakeFS and Iceberg are using this) create multiple logical versions of datasets without any physical duplication. No data movement involved.

The next best approach is zero-copy or data virtualization technologies. Tools like Apache Arrow enable data sharing between processes and systems without serialization overhead. You avoid the costly conversion between formats—no deserializing from source format to an intermediate representation and back again.

When changes occur, delta-based approaches are the best way. Rather than copying the entire dataset, you only store what has changed in new files. If you need to roll back, you simply revert the pointer to the previous file and state while keeping the changed files. This requires data management to manage changes.

The least efficient but simplest approach is full 1:1 data copying. Traditional methods like ODBC transfers, CSV exports, or database dumps require serializing data from the source format, moving it entirely, and deserializing it at the destination (e.g., from MS SQL to Pandas). But also, just creating a copy on S3 while keeping the same format is an expensive transaction, even more so with bigger datasets.

This works best for small datasets where the overhead doesn't matter, and offers the convenience of true isolation and easy rollback without complex change tracking.

We can say we work from metadata → zero-copy → delta → full copy. Let's investigate how lakeFS and other tools solved that problem and which approach they have chosen.

联系我们 contact @ memedata.com