交错增量
Interleaved Deltas

原始链接: https://mmapped.blog/posts/51-interleaved-deltas

本摘要探讨了“编织”(weaves)这一历史悠久的数据结构,它曾被用于源代码控制系统(SCCS)以管理文本版本。编织通过管理交错的增量(deltas),将其视为一系列指令(插入、删除或保留)来重构文件修订。 与 Git 等现代快照系统不同,编织将差异存储为交错的数据块。由于这些数据块可能重叠,重构特定版本需要一个“激活集”(activation set)——即一组决定哪些行可见的增量开关。通过追踪这些激活集,可以恢复任何版本并计算它们之间的变更。 尽管编织已被 Git 取代,但它在今天仍具有参考意义。它能够表示细粒度的历史变更,这使其成为现代分布式版本控制系统以及 Pijul 和 CRDT 等冲突解决系统的概念先驱。作者通过提供一个 Go 语言实现展示了这一点,该实现能够重构版本、使用最长公共子序列(LCS)算法计算差异,并将新变更交织进现有的编织中。归根结底,编织证明了简洁而稳健的设计在解决软件演进复杂问题时的力量。

抱歉。
相关文章

原文


A weave has a reputation of one of the most reinvented data structures in history.

Victor Grishchenko, Mikhail Patrakeev, Chronofold: a data structure for versioned text

Progress is often viewed as a transition from simple to complex. Although individual systems do become more complex over time, long-term progress happens when new ideas and advances in hardware make simple designs feasible.

Think of the simplest version control system you can imagine. It probably stores project snapshots in a backup directory. Maybe it compresses files for efficiency. Add hashing for content addressing, and the sketch starts to resemble Git. I don’t mean this as an insult to Git; quite the opposite:

the structure of a Git repository is stunningly simple. It’s so simple in fact, that you’ll be surprised at the sophistication of the things you can do with it. But that’s the way the best code will always be: simple, solid premises out of which complex applications arise.

Wincent Colaiuta, A look back: Bram Cohen vs Linus Torvalds

The Source Code Control System (sccs), developed by Marc J. Rochkind in the early seventies, had to be more sophisticated. It operated on one file at a time and stored all its revisions in a data structure called interleaved deltas or weaves.

If Git felt like something I could have invented myself, weaves captivated and intimidated me for years. When I finally found time to work out the algorithms, the resources were scarce

A weave is a sequence of instructions for reconstructing file revisions. An instruction consists of a type and an operand.

Type Operand Meaning
Line Line index Output a line.
BeginInsert Version number Start a block of lines added in a version.
BeginDelete Version number Start a block of lines deleted in a version.
End Version number Close a block for a version.
The weave instruction type.
type InstructionType int

const (
	Line InstructionType = iota
	BeginInsert
	BeginDelete
	End
)


type Instruction struct {
	Type InstructionType
	
	
	
	Payload int
}

func (i Instruction) Line() int {
	return i.Payload
}

func (i Instruction) VersionID() VersionID {
	return VersionID(i.Payload)
}

All line instructions must belong to an Insert or Delete block. The integer in the line instruction points into the global line pool. Using line indices instead of strings shrinks the weave and makes instructions uniform.

Unlike in xml or json, weave blocks can overlap. For example, if v1 adds lines 1 and 2, v2 appends lines 3–6, and v3 removes lines 2–4, then v3’s delta overlaps with deltas of the first two versions.

   ⎧ Insert v1
   ⎪ 1
Δ1 ⎨ Delete v3  ⎫
   ⎪ 2          ⎪
   ⎩ End v1     ⎪
   ⎧ Insert v2  ⎬ Δ3
   ⎪ 3          ⎪
   ⎪ 4          ⎪
Δ2 ⎨ End v3     ⎭
   ⎪ 5
   ⎪ 6
   ⎩ End v2

Versions in a weave depend on one another. In the example above, v3 depends on both v1 and v2 because it deletes lines that they introduced. The relationship between v1 and v2 is ambiguous: v1 could be v2’s parent, or they could be independent branches that v3 merged.

The weave alone is not enough to deduce dependencies between deltas. Most weave-based systems express these dependencies using activation sets.

sccs has hierarchical revision names (1.4 and 2.1, for example). When you check out a revision, the system uses heuristics to compute its activation set—the collection of deltas that contribute to the file content.

I think of activation sets as switches controlling streetlamps on a dark avenue. Flipping them reveals or hides parts of the weave.

For my experiment, I chose flat sequential revision numbering. Each revision can have one or more parents

An active set is a collection of deltas that together make a file revision. We compute it by traversing the version graph.
type VersionID int

type ActiveSet []bool

func (s ActiveSet) Contains(v VersionID) bool {
	return v >= 0 && int(v) < len(s) && s[v]
}

func (s ActiveSet) Deactivate(v VersionID) ActiveSet {
	out := slices.Clone(s)
	out[v] = false
	return out
}

type Version struct {
	ID          VersionID
	Parents     []VersionID
	Author      string
	Description string
	Date        time.Time
}

func activeSetForVersion(versions []Version, v VersionID) ActiveSet {
	activeSet := make(ActiveSet, len(versions)+1)
	stack := []VersionID{v}
	for len(stack) > 0 {
		w := stack[len(stack)-1]
		stack = stack[:len(stack)-1]
		activeSet[w] = true
		for _, p := range versions[w-1].Parents {
			if !activeSet[p] {
				stack = append(stack, p)
			}
		}
	}
	return activeSet
}

We’re now ready to tackle the version reconstruction algorithm.

To reconstruct a revision, we compute its active set and scan the instruction list, keeping track of currently open blocks in a priority queue. If the instruction opens an insert block—for any revision—or a delete block for an active revision, we push it onto the queue. We copy lines if the block at the top of the priority queue is active (that’s why we keep inactive inserts in the queue).

We need a priority queue because blocks can overlap (see the Weave structure section for an example). If blocks were properly nested, a stack would suffice.

My implementation uses a sorted slice as a priority queue for brevity (heaps require a lot of boilerplate in Go).

The Reconstruct function recovers the lines of any file version stored in a weave.
var (
	ErrBadWeave = errors.New("E001: malformed weave")
	ErrBadDelta = errors.New("E002: bad delta")
)






func Reconstruct(
	instructions []Instruction,
	activeSet ActiveSet,
) (mask []bool, versions []VersionID, err error) {
	mask = make([]bool, len(instructions))
	openBlocksByVersion := make(map[VersionID]int)
	
	var activeBlocks []Instruction
	activeBlock := func(v VersionID) (int, bool) {
		return slices.BinarySearchFunc(activeBlocks, v,
			func(instr Instruction, target VersionID) int {
				return cmp.Compare(instr.VersionID(), target)
			})
	}

	for offset, instr := range instructions {
		switch instr.Type {
		case Line:
			if len(activeBlocks) == 0 {
				return nil, nil, ErrBadWeave
			}
			top := activeBlocks[len(activeBlocks)-1]
			v := top.VersionID()
			if activeSet.Contains(v) && top.Type == BeginInsert {
				mask[offset] = true
				versions = append(versions, v)
			}
		case BeginInsert, BeginDelete:
			v := instr.VersionID()
			if _, ok := openBlocksByVersion[v]; ok {
				return nil, nil, ErrBadWeave
			}
			if instr.Type == BeginInsert || activeSet.Contains(v) {
				at, _ := activeBlock(v)
				activeBlocks = slices.Insert(activeBlocks, at, instr)
			}
			openBlocksByVersion[v] = offset
		case End:
			v := instr.VersionID()
			if _, ok := openBlocksByVersion[v]; !ok {
				return nil, nil, ErrBadWeave
			}
			at, found := activeBlock(v)
			if found {
				activeBlocks = slices.Delete(activeBlocks, at, at+1)
			}
			delete(openBlocksByVersion, v)
		default:
			return nil, nil, ErrBadWeave
		}
	}
	if len(openBlocksByVersion) > 0 {
		return nil, nil, ErrBadWeave
	}
	return mask, versions, nil
}

Returning the bit mask of enabled line instructions (instead of their contents) simplifies many operations, including weave extension and delta reconstruction.

Before we can start adding changes to weaves, we need diffs.

Given two sequences of items, we want to find a transformation that turns the first sequence into the second and uses the fewest steps.

The transformation is a sequence of deltas (also known as hunks) that specify an action (Insert, Delete, or Keep) and the items it acts upon. I called such a transformation a diff script.

The diff algorithm I’m most familiar with is the lcs algorithm

  1. Use dynamic programming to build the lcs table. The lcs[i][j] entry is the length of a longest common subsequence of a[i:] and b[j:]. Fill the table back-to-front in both dimensions.
  2. Traverse the lcs table to recover the actions, moving toward lcs increase when a and b don’t align.
The DiffScript function computes an optimal edit script for two sequences of integers.
type Action int

const (
	Insert Action = iota
	Delete
	Keep
)

type Delta struct {
	Action Action
	Items  []int
}

func appendDelta(script []Delta, action Action, item int) []Delta {
	n := len(script)
	if n > 0 && script[n-1].Action == action {
		script[n-1].Items = append(script[n-1].Items, item)
		return script
	}
	return append(script, Delta{Action: action, Items: []int{item}})
}


func DiffScript(a, b []int) (script []Delta) {
	n, m := len(a), len(b)

	
	lcs := make([][]int, n+1)
	for i := range lcs {
		lcs[i] = make([]int, m+1)
	}
	for i := range slices.Backward(a) {
		for j := range slices.Backward(b) {
			if a[i] == b[j] {
				lcs[i][j] = 1 + lcs[i+1][j+1]
			} else {
				lcs[i][j] = max(lcs[i+1][j], lcs[i][j+1])
			}
		}
	}

	
	for i, j := 0, 0; i < n || j < m; {
		switch {
		case i < n && j < m && a[i] == b[j]:
			script = appendDelta(script, Keep, a[i])
			i++
			j++
		case i < n && (j == m || lcs[i+1][j] >= lcs[i][j+1]):
			script = appendDelta(script, Delete, a[i])
			i++
		default:
			script = appendDelta(script, Insert, b[j])
			j++
		}
	}

	return script
}

A practical system would use Myers diff or Bram Cohen’s patience diff. James Coglan, the author of Building Git, wrote a series of blog posts explaining Myers diff and patience diff in great detail.

Now that we have deltas, let’s add them to a weave.

The Interleave function takes a weave and a version within that weave (identified by an active set) and incorporates new deltas, producing a new weave.

It calls the Reconstruct function to annotate each active line in the weave and then traverses the instructions and deltas in parallel, adding new instructions that match delta actions. The parallel loop body is the following procedure:

  1. Copy unmarked instructions to the output as is.
  2. Upon finding the first marked line, stop and consult the delta action.
  3. If the action is Insert, append its items as lines to the output, surrounding them by a BeginInsert/End block.
  4. If the action is Delete, open a BeginDelete block, copy the input stream as long as its active lines match the delta items, then close the block.
  5. If the action is Keep, copy the input until all items have appeared as active lines.
The Interleave function adds new deltas to a weave.




func Interleave(
	instructions []Instruction,
	activeSet ActiveSet,
	deltas []Delta,
	v VersionID,
) (out []Instruction, err error) {
	mask, _, err := Reconstruct(instructions, activeSet)
	if err != nil {
		return nil, err
	}

	i, j, n, m := 0, 0, len(instructions), len(deltas)

	
	
	consumeLines := func(expected []int) error {
		skipped := 0
		for skipped < len(expected) {
			if i >= n {
				return ErrBadDelta
			}
			out = append(out, instructions[i])
			if mask[i] {
				if expected[skipped] != instructions[i].Line() {
					return ErrBadDelta
				}
				skipped++
			}
			i++
		}
		return nil
	}

	for i < n || j < m {
		for i < n && !mask[i] { 
			out = append(out, instructions[i])
			i++
		}
		if j < m {
			delta := deltas[j]
			switch delta.Action { 
			case Insert: 
				out = append(out, Instruction{BeginInsert, int(v)})
				for _, item := range delta.Items {
					out = append(out, Instruction{Line, item})
				}
				out = append(out, Instruction{End, int(v)})
			case Delete: 
				out = append(out, Instruction{BeginDelete, int(v)})
				if err := consumeLines(delta.Items); err != nil {
					return nil, err
				}
				out = append(out, Instruction{End, int(v)})
			case Keep: 
				if err := consumeLines(delta.Items); err != nil {
					return nil, err
				}
			}
			j++
		} else if i < n {
			return nil, ErrBadDelta
		}
	}
	return out, nil
}

Now, we can recover any version stored in a weave and detect and apply changes. The last missing piece is extracting deltas.

How do we know what changed in version v? We could recover v and its predecessor and diff the outputs, but it would be wasteful, given that the weave already stores deltas. A more direct approach is to use the bit masks returned by the Reconstruct function to detect lines that become inactive if we deactivate v.

The ReconstructDelta function diffs two active revision sets.

func ReconstructDelta(
	instructions []Instruction,
	beforeSet, afterSet ActiveSet,
) (deltas []Delta, err error) {
	before, _, err := Reconstruct(instructions, beforeSet)
	if err != nil {
		return nil, err
	}
	after, _, err := Reconstruct(instructions, afterSet)
	if err != nil {
		return nil, err
	}

	for i, instr := range instructions {
		if instr.Type != Line {
			continue
		}
		switch {
		case before[i] && after[i]:
			deltas = appendDelta(deltas, Keep, instr.Line())
		case !before[i] && after[i]:
			deltas = appendDelta(deltas, Insert, instr.Line())
		case before[i] && !after[i]:
			deltas = appendDelta(deltas, Delete, instr.Line())
		}
	}
	return deltas, nil
}

ReconstructDelta can compute deltas between any two versions. If we need the delta of a particular version v, we diff ActiveSet(v){v} and ActiveSet(v).

vSet := activeSetForVersion(versions, v)
vDelta, err := ReconstructDelta(instructions, vSet.Deactivate(v), vSet)

Now we have enough machinery to build a simple version control system that can support most of sccs’s features.

Nobody in their right mind would use sccs today, yet it shaped modern software as few systems did. Marc J. Rochkind’s paper won the first icse Most Influential Paper Award in 1989, but as David Parnas noted

Git is ubiquitous today, and its connection to sccs is more than ideological. Git borrowed many ideas from BitKeeper, a distributed version control system that powered Linux kernel development between 2002 and 2005. BitKeeper, in turn, was a direct descendant of sccs and used weaves to track changes.

Conflict-free Replicated Data Types and the Pijul version control system employ data structures that are strikingly similar to weaves because their history representation makes conflicts easier to resolve

Weaves always return.

  1. Explain why Reconstruct pushes inactive BeginInsert instructions onto the queue but ignores inactive BeginDelete instructions.
  2. Assemble weave algorithms into a simple single-file version control system.
  3. Extend the system to detect history file corruptions.
  4. Implement two-way merge for any two versions.
  5. Extend the system to handle repositories, not just single files.
联系我们 contact @ memedata.com