展示 HN：我将 Tree-sitter 移植到 Go

原文

Pure-Go tree-sitter runtime — no CGo, no C toolchain, WASM-ready.

go get github.com/odvcencio/gotreesitter

Implements the same parse-table format tree-sitter uses, so existing grammars work without recompilation. Outperforms the CGo binding on every workload — incremental edits (the dominant operation in editors and language servers) are 90x faster than the C implementation.

Every existing Go tree-sitter binding requires CGo. That means:

Cross-compilation breaks (GOOS=wasip1, GOARCH=arm64 from Linux, Windows without MSYS2)
CI pipelines need a C toolchain in every build image
go install fails for end users without gcc
Race detector, fuzzing, and coverage tools work poorly across the CGo boundary

gotreesitter is pure Go. go get and build — on any target, any platform.

import (
    "fmt"

    "github.com/odvcencio/gotreesitter"
    "github.com/odvcencio/gotreesitter/grammars"
)

func main() {
    src := []byte(`package main

func main() {}
`)

    lang := grammars.GoLanguage()
    parser := gotreesitter.NewParser(lang)

    tree := parser.Parse(src)
    fmt.Println(tree.RootNode())

    // After editing source, reparse incrementally:
    //   tree.Edit(edit)
    //   tree2 := parser.ParseIncremental(newSrc, tree)
}

Tree-sitter's S-expression query language is supported, including predicates and cursor-based streaming. See Known Limitations for current caveats.

q, _ := gotreesitter.NewQuery(`(function_declaration name: (identifier) @fn)`, lang)
cursor := q.Exec(tree.RootNode(), lang, src)

for {
    match, ok := cursor.NextMatch()
    if !ok {
        break
    }
    for _, cap := range match.Captures {
        fmt.Println(cap.Node.Text(src))
    }
}

After the initial parse, re-parse only the changed region — unchanged subtrees are reused automatically.

// Initial parse
tree := parser.Parse(src)

// User types "x" at byte offset 42
src = append(src[:42], append([]byte("x"), src[42:]...)...)

tree.Edit(gotreesitter.InputEdit{
    StartByte:   42,
    OldEndByte:  42,
    NewEndByte:  43,
    StartPoint:  gotreesitter.Point{Row: 3, Column: 10},
    OldEndPoint: gotreesitter.Point{Row: 3, Column: 10},
    NewEndPoint: gotreesitter.Point{Row: 3, Column: 11},
})

// Incremental reparse — ~1.38 μs vs 124 μs for the CGo binding (90x faster)
tree2 := parser.ParseIncremental(src, tree)

Tip: Use grammars.DetectLanguage("main.go") to pick the right grammar by filename — useful for editor integration.

hl, _ := gotreesitter.NewHighlighter(lang, highlightQuery)
ranges := hl.Highlight(src)

for _, r := range ranges {
    fmt.Printf("%s: %q\n", r.Capture, src[r.StartByte:r.EndByte])
}

Note: Text predicates (#eq?, #match?, #any-of?, #not-eq?) require source []byte to evaluate. Passing nil disables predicate checks.

Extract definitions and references from source code:

entry := grammars.DetectLanguage("main.go")
lang := entry.Language()

tagger, _ := gotreesitter.NewTagger(lang, entry.TagsQuery)
tags := tagger.Tag(src)

for _, tag := range tags {
    fmt.Printf("%s %s at %d:%d\n", tag.Kind, tag.Name,
        tag.NameRange.StartPoint.Row, tag.NameRange.StartPoint.Column)
}

Each LangEntry exposes a Quality field indicating how trustworthy the parse output is:

Quality	Meaning
`full`	Token source or DFA with external scanner — full fidelity
`partial`	DFA-partial — missing external scanner, tree may have silent gaps
`none`	Cannot parse

entries := grammars.AllLanguages()
for _, e := range entries {
    fmt.Printf("%s: %s\n", e.Name, e.Quality)
}

Measured against go-tree-sitter (the standard CGo binding), parsing a Go source file with 500 function definitions.

goos: linux / goarch: amd64 / cpu: Intel(R) Core(TM) Ultra 9 285

# pure-Go parser benchmarks (root module)
go test -run '^$' -bench 'BenchmarkGoParse' -benchmem -count=3

# C baseline benchmarks (cgo_harness module)
cd cgo_harness
go test . -run '^$' -tags treesitter_c_bench -bench 'BenchmarkCTreeSitterGoParse' -benchmem -count=3

Benchmark	ns/op	B/op	allocs/op
`BenchmarkCTreeSitterGoParseFull`	2,058,000	600	6
`BenchmarkCTreeSitterGoParseIncrementalSingleByteEdit`	124,100	648	7
`BenchmarkCTreeSitterGoParseIncrementalNoEdit`	121,100	600	6
`BenchmarkGoParseFull`	1,330,000	10,842	2,495
`BenchmarkGoParseIncrementalSingleByteEdit`	1,381	361	9
`BenchmarkGoParseIncrementalNoEdit`	8.63	0	0

Summary:

Workload	gotreesitter	CGo binding	Ratio
Full parse	1,330 μs	2,058 μs	~1.5x faster
Incremental (single-byte edit)	1.38 μs	124 μs	~90x faster
Incremental (no-op reparse)	8.6 ns	121 μs	~14,000x faster

The incremental hot path reuses subtrees aggressively — a single-byte edit reparses in microseconds while the CGo binding pays full C-runtime and call overhead. The no-edit fast path exits on a single nil-check: zero allocations, single-digit nanoseconds.

205 grammars ship in the registry. Run go run ./cmd/parity_report for live per-language status.

Current summary:

204 full — parse without errors (token source or DFA with complete external scanner)
1 partial — norg (requires external scanner with 122 tokens, not yet implemented)
0 unsupported

Backend breakdown:

195 dfa — DFA lexer with hand-written Go external scanner where needed
1 dfa-partial — generated DFA without external scanner (norg)
9 token_source — hand-written pure-Go lexer bridge (authzed, c, go, html, java, json, lua, toml, yaml)

111 languages have hand-written Go external scanners attached via zzz_scanner_attachments.go.

Full language list (205): ada, agda, angular, apex, arduino, asm, astro, authzed, awk, bash, bass, beancount, bibtex, bicep, bitbake, blade, brightscript, c, c_sharp, caddy, cairo, capnp, chatito, circom, clojure, cmake, cobol, comment, commonlisp, cooklang, corn, cpon, cpp, crystal, css, csv, cuda, cue, cylc, d, dart, desktop, devicetree, dhall, diff, disassembly, djot, dockerfile, dot, doxygen, dtd, earthfile, ebnf, editorconfig, eds, eex, elisp, elixir, elm, elsa, embedded_template, enforce, erlang, facility, faust, fennel, fidl, firrtl, fish, foam, forth, fortran, fsharp, gdscript, git_config, git_rebase, gitattributes, gitcommit, gitignore, gleam, glsl, gn, go, godot_resource, gomod, graphql, groovy, hack, hare, haskell, haxe, hcl, heex, hlsl, html, http, hurl, hyprlang, ini, janet, java, javascript, jinja2, jq, jsdoc, json, json5, jsonnet, julia, just, kconfig, kdl, kotlin, ledger, less, linkerscript, liquid, llvm, lua, luau, make, markdown, markdown_inline, matlab, mermaid, meson, mojo, move, nginx, nickel, nim, ninja, nix, norg, nushell, objc, ocaml, odin, org, pascal, pem, perl, php, pkl, powershell, prisma, prolog, promql, properties, proto, pug, puppet, purescript, python, ql, r, racket, regex, rego, requirements, rescript, robot, ron, rst, ruby, rust, scala, scheme, scss, smithy, solidity, sparql, sql, squirrel, ssh_config, starlark, svelte, swift, tablegen, tcl, teal, templ, textproto, thrift, tlaplus, tmux, todotxt, toml, tsx, turtle, twig, typescript, typst, uxntal, v, verilog, vhdl, vimdoc, vue, wgsl, wolfram, xml, yaml, yuck, zig

Feature	Status
Compile + execute (`NewQuery`, `Execute`, `ExecuteNode`)	supported
Cursor streaming (`Exec`, `NextMatch`, `NextCapture`)	supported
Structural quantifiers (`?`, `*`, `+`)	supported
Alternation (`[...]`)	supported
Field matching (`name: (identifier)`)	supported
`#eq?` / `#not-eq?`	supported
`#match?` / `#not-match?`	supported
`#any-of?` / `#not-any-of?`	supported
`#lua-match?`	supported
`#has-ancestor?` / `#not-has-ancestor?`	supported
`#not-has-parent?`	supported
`#is?` / `#is-not?`	supported
`#set!` / `#offset!` directives	parsed and accepted

As of February 23, 2026, all shipped highlight and tags queries compile in this repo (156/156 non-empty HighlightQuery entries, 69/69 non-empty TagsQuery entries).

No known query-syntax gaps currently block shipped highlight or tags queries.

1 language (norg) requires an external scanner that has not been ported to Go. It parses using the DFA lexer alone, but tokens that require the external scanner are silently skipped. The tree structure is valid but may have gaps. Check entry.Quality to distinguish full from partial.

1. Add the grammar to grammars/languages.manifest.

2. Generate bindings:

go run ./cmd/ts2go -manifest grammars/languages.manifest -outdir ./grammars -package grammars -compact=true

This regenerates grammars/embedded_grammars_gen.go, grammars/grammar_blobs/*.bin, and language register stubs.

3. Add smoke samples to cmd/parity_report/main.go and grammars/parse_support_test.go.

4. Verify:

go run ./cmd/parity_report
go test ./grammars/...

gotreesitter reimplements the tree-sitter runtime in pure Go:

Parser — table-driven LR(1) with GLR support for ambiguous grammars
Incremental reuse — cursor-based subtree reuse; unchanged regions skip reparsing entirely
Arena allocator — slab-based node allocation with ref counting, minimizing GC pressure
DFA lexer — generated from grammar tables via ts2go, with hand-written bridges where needed
External scanner VM — bytecode interpreter for language-specific scanning (Python indentation, etc.)
Query engine — S-expression pattern matching with predicate evaluation and streaming cursors
Highlighter — query-based syntax highlighting with incremental support
Tagger — symbol definition/reference extraction using tags queries

Grammar tables are extracted from upstream tree-sitter parser.c files by the ts2go tool, serialized into compressed binary blobs, and lazy-loaded on first language use. No C code runs at parse time.

To avoid embedding blobs into the binary, build with -tags grammar_blobs_external and set GOTREESITTER_GRAMMAR_BLOB_DIR to a directory containing *.bin grammar blobs. External blob mode uses mmap on Unix by default (GOTREESITTER_GRAMMAR_BLOB_MMAP=false to disable).

To ship a smaller embedded binary with a curated language set, build with -tags grammar_set_core (core set includes common languages like c, go, java, javascript, python, rust, typescript, etc.).

To restrict registered languages at runtime (embedded or external), set:

GOTREESITTER_GRAMMAR_SET=go,json,python

For long-lived processes, grammar cache memory is tunable:

// Keep only the 8 most recently used decoded grammars in cache.
grammars.SetEmbeddedLanguageCacheLimit(8)

// Drop one language blob from cache (e.g. "rust.bin").
grammars.UnloadEmbeddedLanguage("rust.bin")

// Drop all decoded grammars from cache.
grammars.PurgeEmbeddedLanguageCache()

You can also set GOTREESITTER_GRAMMAR_CACHE_LIMIT at process start to apply a cache cap without code changes. Set it to 0 only when you explicitly want no retention (each grammar access will decode again).

Idle eviction can be enabled with env vars:

GOTREESITTER_GRAMMAR_IDLE_TTL=5m
GOTREESITTER_GRAMMAR_IDLE_SWEEP=30s

Loader compaction/interning is enabled by default and tunable via:

GOTREESITTER_GRAMMAR_COMPACT=true
GOTREESITTER_GRAMMAR_STRING_INTERN_LIMIT=200000
GOTREESITTER_GRAMMAR_TRANSITION_INTERN_LIMIT=20000

The test suite includes:

Smoke tests — all 205 grammars parse a sample without crashing or producing ERROR nodes
Correctness snapshots — golden S-expression tests for 20 core languages catch parser and grammar regressions
Highlight validation — end-to-end test that compiled highlight queries produce highlight ranges
Query tests — pattern matching, predicates, cursors, field-based matching
Parser tests — incremental reparsing, error recovery, GLR ambiguity resolution
Fuzzing — FuzzGoParseDoesNotPanic for parser robustness

go test ./... -race -count=1

Current: v0.4.0 — 205 grammars, stable parser, incremental reparsing, query engine, highlighting, tagging.

Next:

Query engine parity hardening — field-negation semantics, metadata directive behavior, and additional edge-case parity with upstream tree-sitter query execution
More hand-written external scanners for high-value dfa-partial languages
Parse() (*Tree, error) — return errors instead of silent nil trees
Automated parity testing against the C tree-sitter output
Fuzzing expansion to cover more languages and the query engine

MIT

展示 HN：我将 Tree-sitter 移植到 Go Show HN: I ported Tree-sitter to Go

展示 HN：我将 Tree-sitter 移植到 Go
Show HN: I ported Tree-sitter to Go