比格 CRDT SCM 外部接口

比格 CRDT SCM 外部接口
Beagle CRDT SCM outer interface

原始链接: https://gist.github.com/gritzko/9b3ac4ebb9bd8e38895d4629d0f9b151

## Beagle SCM：代码数据库传统的SCM系统，如Git，虽然强大，但已经变得过于复杂，更像文件系统而非代码数据库。这种复杂性会降低开发速度，尤其是在AI辅助开发兴起的情况下。Beagle旨在通过充当数据库来解决这个问题，它存储抽象语法树（AST），而不仅仅是blob，从而实现代码的语义查询和操作。 Beagle将操作简化为四个核心命令——GET、POST、PUT和DELETE，模仿HTTP，并利用URI进行寻址。它引入了repos、branches（更接近Git仓库）、twigs（比branches更轻量，用于临时工作）和overlays（类似于Photoshop的图层，分离代码、提示和配置）的层次结构。关键特性包括确定性的、非侵入式的CRDT合并，允许安全灵活的分支和代码混合。Beagle支持高级查询——搜索特定符号或AST子树——超越了简单的`grep`功能，这既对开发者有益，也对LLM有益。该系统优先采用结构化但更简单的方法来管理代码，这对于处理越来越多的AI生成代码至关重要。最终，Beagle旨在成为一个简单、可靠的工具，用于将代码作为超文本进行管理，其中IDE充当浏览器，而Beagle充当`curl`/`wget`的等效工具。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 Beagle CRDT SCM 外部接口 (gist.github.com) 7 分，作者 gritzko，4 小时前 | 隐藏 | 过去 | 收藏 | 讨论帮助指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

Part I. SCM as a database for the code

The outer interface of a revision control system is often complex. In particular, git's CLI is easy to pick on because of the way it grew unconstrained into a jungle of commands, options and syntaxes with overlapping concerns. Git UI ceremony distracts and limits velocity to a very noticeable degree.

Fundamentally, SCM moves changes between worktrees and the repo, but git's multilevel system makes things look rather complex. With k types of buckets, we have k*k kinds of bucket-to-bucket moves. With remote branches different from local branches, staging and stash different from commits, plus the worktree, we get 6x6=36 kinds of potential data maneuvers. Ideally, this should be 2x2=4, worktree and repo. Is that realistic? In short, yes.

We can also look at this from the other side: what useful functions do we get at the cost of that complexity? It is way above trivial as git codebase is 310KLoC of C code and about the same amount of sh/perl/tcl. That is x15 more than LevelDB, /3 less than PostgreSQL, and generally in the ballpark of a general-purpose database.

Still, we can not query the branches and the trees in any ways more advanced than grep. Author's personal experience is that with no issue tracker, branches get stale and forgotten even in solo development mode, sadly. LLMs add to that, as now we have no solo mode, and LLMs just love to reimplement things, each time very imperfectly. The awareness is lacking.

Finally, the ability to split and join content is critical in managing the mass of code. Apart from submodule/ monorepo aspect, there is the method of overlays where we split a worktree into distinct layers (e.g. code, prompts and configs) able to work with them jointly or separately, depending on circumstances. That is like Photoshop layers. This idea circulated in CRDT community for quite some years.

Overall, things better be more structured, but less complicated. As AIs are piling up the code, we have to keep track of it and maintain the structure.

git is a filesystem, it says so on the box and it stores blobs. Beagle is a database for your code, it stores AST*. That allows to address not only specific files for diffing/ querying/ merging/ cherry-picking, but also specific symbols and AST* subtrees. That allows for complex querying of versioned sources and text. Beagle is useful for users and LLMs alike when one has to juggle a dozen branches at a time.

The next section talks about Beagle's project/ branch/ twig/ overlay model which is slightly different from git's: branches are closer to git repos, twigs are like git branches, but lighter and overlays have no parallel in git at all. CRDT merges are deterministic and non-intrusive, so one can merge left and right, using worktree as a palette for blending.

The section after that talks about Beagle's core/plumbing commands: GET, POST, PUT and DELETE. Yes, like HTTP.

Skip next two sections if you want to see the resulting UX first. Long story short: mainly the same four commands plus URI-based syntax for everything.

Beagle SCM: repos, branches and twigs

How to make a command/ referral language flexible enough to express all the use cases by composing a minimal number of plain intuitive primitives? This problem is essentially a language problem.

In respect to addressing, Beagle bets on URIs. What worked for a World Wide Web in all its vastness, should also work for intra/inter repo referencing.

Encouraged by that idea, Beagle sets the scope of the system to global. One key feature of git was to only version an entire project as a whole. Lets think: what can we do to version an entire working system, all sources and configs, so each repo is a small GitHub hosting a number of projects?

If we want to limit ourselves to 4 basic kinds of maneuvers, those are:

moving changes from worktree to the repo,
moving changes from the repo to a worktree,
moving things in the worktree,
moving things in the repo.

We assume the current worktree is linked to one fixed place in the repo. Things look a bit too primitive so far. Then, we chalk the repo into squares:

repo is divided into branches which have public identity, their names are FQDN-like, e.g. branch.team.company.com or release.product.entity.org;
orthogonally, the repo is divided into projects, also with public identity, e.g. @gritzko/librdx (like a GitHub path). So a full URI is like http://main.replicated.live/@gritzko/librdx

Here the maneuver #4 gets subdivided into submaneuvers, the most frequent case being changeset exchange between branches. Note that branches are not scoped to a repo or even to a project. When we create a branch, we "fork the world". That mostly makes sense because projects form their own dependency graphs anyway, so version alpha of project A needs version beta of project B and so on. Once we create a branch, we may put in all the relevant code. With syntax-aware CRDT merge, we can be a bit bolder in forking things, as we retain enough metadata to ease merges.

On top of that, the mapping between file system paths and projects is not 1:1. First of all, one project can have several worktrees, that is normal. Second, one worktree can contain several blended branches or projects. Merging the branches is nothing special, let's talk about the other case. Suppose we want to split one project into the base and its overlays. For example, prompts, plans and TODOs live in the same dirs in the worktree, but belong to a different overlay project in the repo, @gritzko/librdx vs @gritzko/librdx.ai. We can work with the source, we can add the AI work docs, or we can deal with prompts and logs separately from sources.

The last caveat for those familiar with git (all of us) is twigs. Apart from the head, a branch can have multiple marked twigs, which are supposed to merge in near future. The distinction here is that twigs are scoped to a project/branch/repo, and have no public identity. When each developer teams up with AI, cheaper transient branching is necessary, locally and within a team. So public branches are heavier than git branches and twigs are somewhat lighter. While a twig is essentially a sticky note on a hash, CRDT merges are deterministic and non-intrusive, so merging (blending) twigs invokes much less work and ceremony than merging git branches.

Plumbing: `GET POST PUT DELETE`

Back to the original question, lets see whether an URI based referencing language and 4 HTTP verbs are sufficient to express the operations we want. GET, POST, PUT and DELETE correspond to maneuvers #2, #1, #4, #4 resp. Maneuver #3 is cp, rm, vim, etc.

GET http://branch.team.entity.org/project?twigA simple checkout of a particular twig version (may need to clone first);
GET //branch2 switching the branch;
GET /project/dir/file.txt checkout one file;
POST ./file.txt stage one file (it gets imported into the repo, but the twig does not move yet);
DELETE somefile.txt delete;
PUT ./file.txt?twigB merge in file changes from other twig;
GET ?twigB switch the twig;
GET ?timestamp-origin checkout a version by its timestamp;
GET ?4d2130 checkout a version by its hash;
GET ?twigA#has(x) list all uses of symbol x in twigA;
POST /project?twigA commit all changes to a twig;
PUT //branch2?twigC merge a twig of another branch;
POST ?stash; GET ?twigA stash the changes;
POST ?twigA commit changes (import, move the twig);
GET ?twigA#has(int,getX) from the twig, list all AST* nodes that have children int and getX (likely declaration and definition of int getX();
GET //branch/project/dir#has(int,getX) same but fancier;
PUT http://remote.branch.team.entity.org big time pull;

In fact, most everyday commands would break down into several GET, POST, PUT, DELETE calls as, for example, refreshing the work tree also requres temporary stash of worktree changes and their merge back into the refreshed version. Similarly, push to a remote branch is first a POST to a local copy and then PUT to a remote server.

While it is handy that the plumbing layer of CLI is virtually identical to the HTTP interface, for user convenience we need the "porcelain" layer doing all the everyday combos in one go.

The mission of the porcelain command layer is to let the user rely on the power of the technology while keeping him/her safe and sane.

Both plumbing and porcelain layers turn to be quite compact so far and most of nuance is coded into URIs while CLI verbs only define the general maneuver. One tradeoff here is that the user must have some intuition of URI syntax. LLMs certainly have it, so no worry if you don't.

Code is hypertext, IDE is a browser. Beagle is your curl/wget, a simple reliable everyday tool.

Same as plumbing, porcelain commands implement three maneuvers:

get data from repo to worktree,
post data from worktree to repo,
put moves data laterally in a repo.

There are some shortcuts for combos, but most of work is get, post, put. The most straightforward linear workflow looks like:

be get //branch/project clone/checkout a worktree
be come ?twig fork off a twig (combo of be post ?twig + be get ?twig)
... do some work
be post commit/stage all twig changes
be put merge in the branch head (or be get ?head ?twig ... be post, a subtly more delicate way to achieve same result)
... verify things work as intended
be post ?head merge into the head
...go p.3

Mixing branches or twigs is done by the same get verb but with multiple arguments. Use worktree as a palette where you mix and blend colors. Once satisfied, lay the paint on the canvas (post it back to the repo).

be get ?twigA ?twigB ?head
be post ?twigABH

CRDT merge never fails, technically. That does not guarantee that your worktree would build or run correctly. Semantics is entirely your(s LLM's) responsibility. Beagle allows to merge/ undo/ juggle changes quickly. That is the best thing SCM can do.

Handy commands and aliases

There are aliases/combos for typical cases, e.g.

be come ?twig make the worktree version into a twig
be diff diff to the head (default, 3way)
be lay make a waypoint commit (be post ?timedate)
be moan rollback one post
be rate mark the current commit
be fit merge into the head (be post ?head)
be overview of the current state (more than status)

Some shells treat ? as a special symbol, we may skip it most of the time. There is risk of URI ?query being confused for a file name and other things, so in this doc ? is never skipped. Still, be get featureA tweakB should be OK (most of the time).

Beagle is balanced differently than git. There is one Beagle repo per system, Beagle branches are between git branches and git repos, while Beagle twigs are lighter than git branches (may see them as patch stacks). Approximate command equivalents:

git init dir/ be post dir/
git stash push there is no difference between stash and any other commit, so be post ?mystash is enough
git add a.txt b.txt same, be post a.txt b.txt
git clone http://uri be get http://branch.team.entity.org
git push origin a:b be post http://branch branch names are FQDNs
git pull origin b:a be get http://branch ? && be post where ? is the expression for the current worktree's branch/twig formula
git merge xxx be get ?twigA ?twigB
git status be

Beagle (will) implement combos for key git commands.

We started with a claim that Beagle is a database, not a filesystem. It stores a basic AST tree of the source code, which allows for basic code manipulation and search. That is a great opportunity to minimize busywork both for users and LLMs. That is especially valuable when digging code written by somebody else (which is the case in the overwhelming % of cases as individual contributors rely on LLM more and more). Here are some examples of less trivial Beagle commands.

be get /project /project.ai blend project and its prompt overlay (technically, a separate project)
be get ?head ?twig blend head and twig (no repo changes)
be put ?twig#DoThing cherry pick a symbol from a twig (will extract a patch based on the AST* tree)
be get ?twig#DoThing same, but no commit, worktree only
be put ./file.txt?twig cherry pick a file from a twig
be get ./file.txt?twig get a file from a twig (no commit)
be get ./file.txt?twig#Some cherry pick a symbol in a file
be put ?featureA&featureB merge in two twigs
be post ?newtwig fork
be post //newbranch big time fork
be diff ?head#SomeClass find any changes to SomeClass since head (prints out patches)
be diff ./file.txt?v1.2 find all changes to file.txt since v1.2
be diff #has(DoThing,int) diff int DoThing() specifically
be get ?#todo(asan) find things to sanitize, any twig

In fact, the semantic load on the verbs of be CLI is to give the direction data moves in. We may also use a convention with no verbs at all: be uri_dest uri_src1 uri_src2...

That way, be - ?twigA //branchB is a merge into a working tree, while be //release ?head ?tweaks is a merge into the release branch head bypassing the working tree (reckless).

Overall, verbless use allows non-standard/advanced use patterns.

What Beagle internally processes is not exactly AST but RDX, a CRDT JSON superset, tree-ish document format. Beagle employs codecs to import and export files into/from RDX. Hence, most queries have to rely on generic document tree structure. The exact codec machinery may vary, e.g. a *.c file may be im/exported with: general text codec, tree-sitter based codec, clang AST based codec or blob fallback. Changing the codec resets file's history. Apart from the tree structure per se, codecs may tag nodes (the bit budget is rather tight there). That way, queries may distinguish function from a class, invocation from declaration, and so on.

Based on that rather generic information, we can have 80/20 of your typical code navigation: callers/callees, definitions, todos, and so on. That is way more accurate than grep (how do you grep for a function body?). Still, this may fall short of full IDE capabilities. For an inquiring agent, that might be just right though.

mdp(worktree) grep for markdown paragraphs (not lines)
grep("search") grep-like generic search
has(int,getLen) find nodes having children int and getLen (e.g. a typical C function definition)
fn(int,getLen) find specifically tagged function definitions
use(getLen) find uses of a symbol
funcs(use(getLen)) find functions using a symbol
files(use(getLen)) find files using a symbol
todo(fuzz) search for TODOs mentioning fuzzing
...and so on

Query notation is RDX, although that hardly matters as it is generic enough. Each query produces a set of document elements (AST* nodes) that a command can be scoped to (diff, get, post, etc). So, for example, we can change signature of a function and commit specifically those hunks by a one-liner.

If your next question is how to make this work efficiently, wait for Part III.

Acknowledgements. A.Borzilov, N. Prokopov (aka tonsky), J.Syrowiecki contributed feedback and ideas for this draft.

Part I. SCM as a database for the code

Part III. Inner workings of CRDT revision control.

Part IV. Experiments.

Part V. The Vision.