Sem – 语义版本控制。基于 Git 的实体级别差异。
Sem – Semantic version control. Entity-level diffs on top of Git

原始链接: https://github.com/ataraxy-labs/sem

## Sem:语义版本控制与差异分析 Sem 是一款工具,它通过**语义差异**增强 Git 的功能,超越了基于行的更改,从而识别出*哪些*代码被更改——例如,添加或修改了一个函数——而不是*哪里*。它为 13 种编程语言(TypeScript、Python、Go、Rust、Java 等)以及结构化数据格式(JSON、YAML、TOML)提供实体级别的差异。 **主要特性:** * **实体识别:** 检测函数、类和其他代码实体的添加、修改和删除。 * **重命名/移动检测:** 使用结构化哈希和模糊相似性识别重命名或移动的实体。 * **影响分析:** 确定实体更改可能造成的破坏。 * **多功能用法:** 直接在 Git 仓库中使用,支持暂存的更改、特定的提交,甚至 stdin 输入。 * **JSON 输出:** 支持与 AI 代理和 CI 管道集成。 Sem 提供 CLI 版本(可通过 Rust 或预构建的二进制文件安装)和 Rust 库版本,以便集成到其他工具中,例如 weave 和 inspect。它利用 tree-sitter 进行解析,并使用 git2 进行 Git 操作。

## Sem:语义版本控制 - Hacker News 讨论 一个名为“Sem”的新项目(之前名为“graft”和“got”)旨在在Git之上提供语义版本控制,利用实体级别差异。Hacker News上的讨论集中在“语义差异”与“语法感知差异”的实用性和定义上——一些人质疑这种区别是否有意义,因为编译器会在简单的AST比较之外优化代码。 用户分享了使用现有语义差异工具的经验,这些工具适用于较简单的语言,如YAML和JSON,并赞赏它们忽略无关差异(如数组顺序)的能力。 许多评论员强调需要关注*变化了什么*(函数、特性)而不是*在哪里*(行号)的差异,尤其是在AI驱动代码提交日益普及的情况下。 其他人指出“difftastic”和“Beagle”等项目是相关的努力,Beagle采取了一种独特的方法,即直接在数据库中存储AST树。 讨论还涉及当前方法的局限性以及未来AI驱动的差异的潜力,这些差异可以在更高的层次上抽象变化。
相关文章

原文

Semantic version control. Entity-level diffs on top of Git.

Instead of line 43 changed, sem tells you function validateToken was added in src/auth.ts.

sem diff

┌─ src/auth/login.ts ──────────────────────────────────
│
│  ⊕ function  validateToken          [added]
│  ∆ function  authenticateUser       [modified]
│  ⊖ function  legacyAuth             [deleted]
│
└──────────────────────────────────────────────────────

┌─ config/database.yml ─────────────────────────────────
│
│  ∆ property  production.pool_size   [modified]
│    - 5
│    + 20
│
└──────────────────────────────────────────────────────

Summary: 1 added, 1 modified, 1 deleted across 2 files

Build from source (requires Rust):

git clone https://github.com/Ataraxy-Labs/sem
cd sem/crates
cargo install --path sem-cli

Or grab a binary from GitHub Releases.

Works in any Git repo. No setup required.

# Semantic diff of working changes
sem diff

# Staged changes only
sem diff --staged

# Specific commit
sem diff --commit abc1234

# Commit range
sem diff --from HEAD~5 --to HEAD

# JSON output (for AI agents, CI pipelines)
sem diff --format json

# Read file changes from stdin (no git repo needed)
echo '[{"filePath":"src/main.rs","status":"modified","beforeContent":"...","afterContent":"..."}]' \
  | sem diff --stdin --format json

# Only specific file types
sem diff --file-exts .py .rs

# Entity dependency graph
sem graph

# Impact analysis (what breaks if this entity changes?)
sem impact validateToken

# Entity-level blame
sem blame src/auth.ts

13 programming languages with full entity extraction via tree-sitter:

Language Extensions Entities
TypeScript .ts .tsx functions, classes, interfaces, types, enums, exports
JavaScript .js .jsx .mjs .cjs functions, classes, variables, exports
Python .py functions, classes, decorated definitions
Go .go functions, methods, types, vars, consts
Rust .rs functions, structs, enums, impls, traits, mods, consts
Java .java classes, methods, interfaces, enums, fields, constructors
C .c .h functions, structs, enums, unions, typedefs
C++ .cpp .cc .hpp functions, classes, structs, enums, namespaces, templates
C# .cs classes, methods, interfaces, enums, structs, properties
Ruby .rb methods, classes, modules
PHP .php functions, classes, methods, interfaces, traits, enums
Fortran .f90 .f95 .f functions, subroutines, modules, programs

Plus structured data formats:

Format Extensions Entities
JSON .json properties, objects (RFC 6901 paths)
YAML .yml .yaml sections, properties (dot paths)
TOML .toml sections, properties
CSV .csv .tsv rows (first column as identity)
Markdown .md .mdx heading-based sections

Everything else falls back to chunk-based diffing.

Three-phase entity matching:

  1. Exact ID match — same entity in before/after = modified or unchanged
  2. Structural hash match — same AST structure, different name = renamed or moved (ignores whitespace/comments)
  3. Fuzzy similarity — >80% token overlap = probable rename

This means sem detects renames and moves, not just additions and deletions. Structural hashing also distinguishes cosmetic changes (whitespace, formatting) from real logic changes.

{
  "summary": {
    "fileCount": 2,
    "added": 1,
    "modified": 1,
    "deleted": 1,
    "total": 3
  },
  "changes": [
    {
      "entityId": "src/auth.ts::function::validateToken",
      "changeType": "added",
      "entityType": "function",
      "entityName": "validateToken",
      "filePath": "src/auth.ts"
    }
  ]
}

sem-core can be used as a Rust library dependency:

[dependencies]
sem-core = { git = "https://github.com/Ataraxy-Labs/sem", version = "0.3" }

Used by weave (semantic merge driver) and inspect (entity-level code review).

  • tree-sitter for code parsing (native Rust, not WASM)
  • git2 for Git operations
  • rayon for parallel file processing
  • xxhash for structural hashing
  • Plugin system for adding new languages and formats

Star History Chart

MIT OR Apache-2.0

联系我们 contact @ memedata.com