Git是一个文件系统。我们需要一个数据库来存储代码。
Git is a file system. We need a database for the code

原始链接: https://gist.github.com/gritzko/6e81b5391eacb585ae207f5e634db07e

## 版本控制的未来:超越 Git 软件开发正在演变——随着 LLM 扮演核心角色,开发者花费在*编写*代码上的时间减少,而更多地用于*理解*和导航现有代码库。这种转变凸显了 Git 日益增长的局限性,尤其是在涉及 AI 代理和大型单仓库的协作开发兴起的情况下。 Git 的主要问题包括难以管理代码模块、非确定性合并、缺乏超越基本搜索的代码智能,以及僵化的“全有或全无”数据模型。Git 将代码视为非结构化的 blob,阻碍了对变更的高级查询和理解。 作者正在开发 Git 的替代品,放弃兼容性,转而进行根本性的架构重构。这个新系统将专注于版本控制*数据结构*(如抽象语法树),而不是 blob,采用正式的、确定性的合并算法和强大的、结构感知的查询语言——本质上,一个用于代码的数据库。 这旨在解决长期存在的问题,并促进与 AI 的更好协作,从 Git 的基础设计上实现显著的飞跃。

这个Hacker News讨论的核心是,尽管Git无处不在,但它是一个有缺陷的版本控制系统——尤其是在面对现代AI驱动的开发和大型代码库时。发帖者认为Git本质上是一个文件系统,而代码管理需要一个数据库,并分享了一个gist来阐述他们的想法。 评论者们争论“Git兼容性”的必要性,许多人认为与GitHub的无缝集成对于采用至关重要。另一些人则提议“向前兼容性”——允许迁移*到*新系统——而无需复制Git的缺点。 关于Git的痛点有:单仓库(monorepo)的困难、大文件处理以及令人困惑的用户界面。还链接了一些关于Git演进的相关讨论。提到了几个旨在解决这些问题的项目,包括Trustfall和一个名为“lit”的项目。一个普遍的观点是,需要一个更好的虚拟文件系统,可能具有写时复制功能,以支持现代开发工作流程,尤其是涉及LLM和AI代理的工作流程。
相关文章

原文

Software development is changing rapidly and the tool stack has yet to catch up. As we see, the value of IDEs diminishes as developers are less inclined to edit the code now. More and more of the work is browsing and talking to LLMs, less and less is coding and debugging. About 8 years ago I gave a talk at the internal JetBrains conference "Code is hypertext, IDE is a browser". Those points look even more relevant now: effective browsing of code and history is a prerequisite to effective understanding. Understanding underlies everything now. No understanding = no control, then a developer is like a rider who fell off a LLM horse with his foot caught in the stirrup (you may search YouTube to understand what I mean).

git is increasingly becoming a point of friction. LLMs have high throughput in regard to code edits. Sorting out the changes then takes disproportionate time and often repeats your previous work, if you actually read the diffs during the session, which I highly recommend. Even single-person development now becomes collaborative: at the very least, your collaborator is an LLM. In calm waters, running several agents is nothing special. Then I have an entire team, with all the merges and rebases (which we like to do beyond any measure).

That is why I think it is the right time to look for git replacements, and that is why I am working on one. I definitely reject the "git compatible" approach despite the immense gravitation of the existing mass of git repos. jj to git is what subversion was to cvs. What we need is what git was to cvs: a level-up. All the long-standing and all the new issues are all rooted in the core architecture of git. In any other case, those issues would be fixed by now just by gradual and incremental improvement.

The issues are:

  • The monorepo problem: git has difficulty dividing the codebase into modules and joining them back. Apart from the fact that git submodules have been improvised clumsily and haunt us ever since, the very conceptual approach to splitting and joining the code is lacking. All the Big-monorepo companies either use something else or build on top of git.

  • The split/join problem has way more implications. Suppose, for example, I want to keep my prompts and plans in a separate repo, but join them in when necessary. Or, go full JTPP - "just the prompt, please". Git has no solution for such "overlay branches", in principle. There is a source tree in git, there is a build tree somewhere else, and there is a prompt/todo/plan tree in yet another different place.

  • The merge/rebase problem: merge commits create quite a lot of friction, while rebases discard the context and imply hierarchy. Fundamentally, git merges are an act of will, they are not deterministic. Hence, a merge has to be recorded with all the ceremony. On top of that, git is not syntax-aware, so false conflicts are pretty common. Manual resolution of trivial conflicts is another aspect of friction.

  • Lack of any any code insight features better than grep. If SCM is a database for the code, there must be a well-developed query language. I want to see what changed in a particular function since day D or what new uses it had in that time. Like git meets IDEA. IDE/LSP gives us spatial structure of the code, SCM adds temporal dimension. That is especially valuable when investigating what agents actually did.

  • Data accretion problem: once you commit things into the repo, they are tied in the Merkle graph forever. There are ways to receive only the latest version, but any general pay-as-you-go mode is lacking. git's data integrity model is blockchain-like: all or nothing. (In fact, a lot of history is trimmed by rebases, as things would be unmanageable otherwise. That is also an issue, as the actual lineage of an edit gets discarded entirely.)

  • The data model problem: git internally works with blobs, which is quite blunt. In fact, we got to the bottom of it: git is a content-addressable filesystem, not a content-addressable database.

Overall, we need a database for the code!

Again, these points I mentioned at various conferences during the past 10 years, and many other people in the CRDT community talked about "overlay branches" and "CRDT revision control" for 10-15 years. In essence it all boils down to two things:

  1. versioning data structures, not blobs and
  2. having formal deterministic merge algorithms (associative, commutative, idempotent).

One approach to it was to represent text as a CRDT vector of letters, and it was quite popular in the field. Zed's DeltaDB aligns with that approach. I also made such systems in the past. It is safe to assume it the default. On the other hand, if we look into the inners of any JetBrains IDE or LLVM internals, we will see AST trees. Because code has structure. If you want to treat all source code the same, you use line-based text (like all UNIX tools do). If you want to do fancy stuff, you parse the source and work with ASTs. Git is a filesystem, so it treats everything as a blob (git diff receives input blobs and reconstructs the most plausible edits algorithmically).

Here I see the opportunity: a revision control system working with AST-like trees, with very formal, deterministic and reversible split/join/fork/merge semantics and a structure-aware query language. As a substrate, I use Replicated Data eXchange format (RDX), a JSON superset with very nice CRDT merge semantics.

Part II. Inner workings of CRDT revision control.

Part III. The outer interface (no clusterfuck this time!)

Part IV. Experiments.

Part V. The Vision.

联系我们 contact @ memedata.com