Git for code is very well known, not so much Git for data. Let's explore the current state of Git for data.
How Does Git Work?
To understand Git for data, we need to understand how branching with Git works, so we can apply it to data.
For example, Git branching holds all metadata and changes of the code from each state. This is handled through hashes. But Git is not made for data because it was designed with code versioning in mind, not large binary files or datasets. As Linus Torvalds himself noted, as the creator of Git, large files in Git were never part of the intended use case. The system's architecture of storing complete snapshots and computing hashes for everything works well for text-based code but becomes unwieldy with large data files. But as data practitioners, we actively want to work with data, with state, which is always harder than just code.
Git and Git-like solutions (alternatives are Tangled and Gitea) work. But which of these features do we want for data? And which specific ones do we need more compared to versioning code?
Git has concepts like versioning, rollback, diffs, lineage, branch/merge, and sharing. On the data side, which we get into more later, we have concepts such as files vs tables, structured vs unstructured, schema vs data, branching, and time travel.
For data, we need a storage layer or a way optimized for large data, schemas, and column types without necessarily duplicating the data. We also need to be able to revert the code and state easily. For example, revert the data pipelines that put production in an incorrect state.
If we look at The Struggle of Enterprise Data Integration, we can see that lots of what enterprises struggle with in data is change management and managing complexity. So hopefully, Git for data will help us with this?
How Does It Work with Data?
Data works differently. We need an open way of sharing and moving data that we can then version, branch off to different versions easily, and roll back to older versions.
Source: Git for Data - What, How and Why Now?
Branching is the right word, also what Git is doing:
E---F---G experiment-spark-3
/
C---D dev-testing
/
main A---B---H---I production
\
J---K hotfix-corrupted-sales
We start with a version, and then diverge into different versions, and potentially merge back. Merging different branches is one option we won't need for data compared to code. With code, different features can be developed independently and then merged into the main branch at the end. With data, it's more about testing prod data on dev and then rolling out the code changes to prod, but not merging the "test" branch with the prod branch; otherwise we change, duplicate, or corrupt data.
The LakeFS solution (more on how it works later down) and its implemented Git-like features:
Source: Git for Data - What, How and Why Now?
Tigris's new Fork capabilities solve some of these challenges with fractal snapshots:
You can instantly create an isolated copy of your data for development, testing, or experimentation. Have a massive production dataset you want to play with? You don't need to wait for a full copy. Just fork your source bucket, experiment freely, throw it away, and spin up a new one — instantly.
Their timelines diverge from the source bucket at the moment of the fork. It's the many-worlds version of object storage.
The key is that every object is immutable. Each write creates a new version, timestamped and preserved.
That immutability allows Tigris to version the entire bucket, and capture it as a single snapshot.

This is interesting. Rather than single Delta or Iceberg tables, it versions the full bucket with the help of the versioning capabilities of these open table formats. Tigris says further, "Each object maintains its own version chain, and a snapshot is an atomic cut across all those chains at a specific moment in time."
A more comprehensive example with two different tables and different isolations that helps understand these processes in a data lake example with open table format tables stored on object storage:

Important to know: a snapshot is an atomically consistent version across all those chains at a specific moment in time, and when retrieving a snapshot, Tigris, for example, returns the newest version ≤ snapshot timestamp of each table. For example, Snapshot T3-dev would contain Customer Table v4-dev and only Sales Table v5-dev (not v4-dev).
One technology used behind this is called Prolly Tree, also known as Merkle Trees:
Image from Prolly Trees on Dolt Documentation
For data people, we have hard dependencies on prod data, usually heavy compute in development, lower compute in prod. SW engineers focus on the SDLC (Software Development Lifecycle) and DE engineers need to focus on the data engineering lifecycle. There are many more differences. I wrote a little more on Data Engineer vs. Software Engineer.
Data Movement Efficiency Spectrum
Before we get into the architectural decisions and the tools, let's observe the data movements when we implement Git for data, and let's categorize them by the amount of data movement required, ordered from most to least efficient:
The most efficient approach uses metadata/catalog-based versioning. Catalog pointers that just point to the same files multiple times (lakeFS and Iceberg are using this) create multiple logical versions of datasets without any physical duplication. No data movement involved.
The next best approach is zero-copy or data virtualization technologies. Tools like Apache Arrow enable data sharing between processes and systems without serialization overhead. You avoid the costly conversion between formats—no deserializing from source format to an intermediate representation and back again.
When changes occur, delta-based approaches are the best way. Rather than copying the entire dataset, you only store what has changed in new files. If you need to roll back, you simply revert the pointer to the previous file and state while keeping the changed files. This requires data management to manage changes.
The least efficient but simplest approach is full 1:1 data copying. Traditional methods like ODBC transfers, CSV exports, or database dumps require serializing data from the source format, moving it entirely, and deserializing it at the destination (e.g., from MS SQL to Pandas). But also, just creating a copy on S3 while keeping the same format is an expensive transaction, even more so with bigger datasets.
This works best for small datasets where the overhead doesn't matter, and offers the convenience of true isolation and easy rollback without complex change tracking.
We can say we work from metadata → zero-copy → delta → full copy. Let's investigate how lakeFS and other tools solved that problem and which approach they have chosen.