将 Unix "find" 表达式编译成字节码
Unix "find" expressions compiled to bytecode

原始链接: https://nullprogram.com/blog/2025/12/23/

## 查找编译器总结 本文详细介绍了作者为 Unix `find` 实用程序开发编译器的过程,其动机来自于未来的项目以及对常用实现效率低下的惊讶。现有的 `find` 实现通常依赖于树形遍历解释器,而作者选择使用字节码编译器来提高性能,通过预解析和最小化每个文件的操作来实现。 `find` 表达式语言,利用诸如 `-a` (与)、`-o` (或) 和 `!` (非) 等运算符,被翻译成简单的五条指令的字节码。该字节码作为一个单寄存器机器运行,使用分支指令 (`braf`/`brat`) 来实现逻辑运算。 作者的编译器 `findc` 将 `find` 的中缀表示法转换为后缀表示法,使用 Shunting-yard 算法,然后通过在栈上构建代码片段将其转换为字节码。生成的字节码虽然功能正常,但展示了优化的机会——例如消除冗余的 `NOT` 指令和简化分支——可以通过窥孔优化器来实现。提供了一个 Wasm 演示和源代码,允许用户试验 `find` 命令并观察它们的字节码表示。

## Hacker News 讨论:`find` 命令性能 一篇 Hacker News 讨论围绕着 Unix `find` 命令通常使用简单的树形遍历解释器,而不是将表达式编译成字节码的原因。最初的帖子表达了对这种实现选择的惊讶。 普遍的共识是,`find` 的性能主要受限于磁盘 I/O,使得表达式求值的优化在很大程度上无关紧要。即使使用 RAM 磁盘等快速存储,大量的文件和目录仍然可能使 I/O 成为瓶颈。 虽然字节码方法*可能*提供诸如更易于测试和潜在降低 CPU 使用率之类的优势,但与文件系统操作所花费的时间相比,收益可能微乎其微。 许多评论员指出数据库系统中的类似方法,即使是成熟的数据库,由于 I/O 的主导地位,也经常优先使用树形遍历进行表达式求值。 优化集中在 I/O 和复杂计算上,而不是简单的文件过滤。 最终,简单性和易于实现是 `find` 设计的关键因素,因为解释器直接反映了搜索表达式的树形结构。
相关文章

原文

nullprogram.com/blog/2025/12/23/

In preparation for a future project, I was thinking about at the unix find utility. It operates a file system hierarchies, with basic operations selected and filtered using a specialized expression language. Users compose operations using unary and binary operators, grouping with parentheses for precedence. find may apply the expression to a great many files, so compiling it into a bytecode, resolving as much as possible ahead of time, and minimizing the per-element work, seems like a prudent implementation strategy. With some thought, I worked out a technique to do so, which was simpler than I expected, and I’m pleased with the results. I was later surprised all the real world find implementations I examined use tree-walk interpreters instead. This article describes how my compiler works, with a runnable example, and lists ideas for improvements.

For a quick overview, the syntax looks like this:

$ find [-H|-L] path... [expression...]

Technically at least one path is required, but most implementations imply . when none are provided. If no expression is supplied, the default is -print, e.g. print everything under each listed path. This prints the whole tree, including directories, under the current directory:

To only print files, we could use -type f:

$ find . -type f -a -print

Where -a is the logical AND binary operator. -print always evaluates to true. It’s never necessary to write -a, and adjacent operations are implicitly joined with -a. We can keep chaining them, such as finding all executable files:

$ find . -type f -executable -print

If no -exec, -ok, or -print (or similar side-effect extensions like -print0 or -delete) are present, the whole expression is wrapped in an implicit ( expr ) -print. So we could also write this:

$ find . -type f -executable

Use -o for logical OR. To print all files with the executable bit or with a .exe extension:

$ find . -type f \( -executable -o -name '*.exe' \)

I needed parentheses because -o has lower precedence than -a, and because parentheses are shell metacharacters I also needed to escape them for the shell. It’s a shame find didn’t use [ and ] instead! There’s also a unary logical NOT operator, !. To print all non-executable files:

$ find . -type f ! -executable

Binary operators are short-circuiting, so this:

$ find -type d -a -exec du -sh {} +

Only lists the sizes of directories, as the -type d fails causing the whole expression to evaluate to false without evaluating -exec. Or equivalently with -o:

$ find ! -type d -o -exec du -sh {} +

If it’s not a directory then the left-hand side evaluates to true, and the right-hand side is not evaluated. All three implementations I examined (GNU, BSD, BusyBox) have a -regex extension, and eagerly compile the regular expression even if the operation is never evaluated:

$ find . -print -o -regex [
find: bad regex '[': Invalid regular expression

I was surprised by this because it doesn’t seem to be in the spirit of the original utility (“The second expression shall not be evaluated if the first expression is true.”), and I’m used to the idea of short-circuit validation for the right-hand side of a logical expression. Recompiling for each evaluation would be unwise, but it could happen lazily such that an invalid regular expression only causes an error if it’s actually used. No big deal, just a curiosity.

Bytecode design

A bytecode interpreter needs to track just one result at a time, making it a single register machine, with a 1-bit register at that. I came up with these five opcodes:

halt
not
braf   LABEL
brat   LABEL
action NAME [ARGS...]

Obviously halt stops the program. While I could just let it “run off the end” it’s useful to have an actual instruction so that I can attach a label and jump to it. The not opcode negates the register. braf is “branch if false”, jumping (via relative immediate) to the labeled (in printed form) instruction if the register is false. brat is “branch if true”. Together they implement the -a and -o operators. In practice there are no loops and jumps are always forward: find is not Turing complete.

In a real implementation each possible action (-name, -ok, -print, -type, etc.) would get a dedicated opcode. This requires implementing each operator, at least in part, in order to correctly parse the whole find expression. For now I’m just focused on the bytecode compiler, so this opcode is a stand-in, and it kind of pretends based on looks. Each action sets the register, and actions like -print always set it to true. My compiler is called findc (“find compiler”).

Update: Or try the online demo via Wasm! This version includes a peephole optimizer I wrote after publishing this article.

I assume readers of this program are familiar with push macro and Slice macro. Because of the latter it requires a very recent C compiler, like GCC 15 (e.g. via w64devkit) or Clang 22. Try out some find commands and see how they appear as bytecode. The simplest case is also optimal:

$ findc
// path: .
        action  -print
        halt

Print the path then halt. Simple. Stepping it up:

$ findc -type f -executable
// path: .
        action  -type f
        braf    L1
        action  -executable
L1:     braf    L2
        action  -print
L2:     halt

If the path is not a file, it skips over the rest of the program by way of the second branch instruction. It’s correct, but already we can see room for improvement. This would be better:

        action  -type f
        braf    L1
        action  -executable
        braf    L1
        action  -print
L1:     halt

More complex still:

$ findc -type f \( -executable -o -name '*.exe' \)
// path: .
        action  -type f
        braf    L1
        action  -executable
        brat    L1
        action  -name *.exe
L1:     braf    L2
        action  -print
L2:     halt

Inside the parentheses, if -executable succeeds, the right-hand side is skipped. Though the brat jumps straight to a braf. It would be better to jump ahead one more instruction:

        action  -type f
        braf    L2
        action  -executable
        brat    L1
        action  -name *.exe
        braf    L2
L1      action  -print
L2:     halt

Silly things aren’t optimized either:

$ findc ! ! -executable
// path: .
        action  -executable
        not
        not
        braf    L1
        action  -print
L1:     halt

Two not in a row cancel out, and so these instructions could be eliminated. Overall this compiler could benefit from a peephole optimizer, scanning over the program repeatedly, making small improvements until no more can be made:

  • Delete not-not.
  • A brat to a braf re-targets ahead one instruction, and vice versa.
  • Jumping onto an identical jump adopts its target for itself.
  • A not-braf might convert to a brat, and vice versa.
  • Delete side-effect-free instructions before halt (e.g. not-halt).
  • Exploit always-true actions, e.g. -print-braf can drop the branch.

Writing a bunch of peephole pattern matchers sounds kind of fun. Though my compiler would first need a slightly richer representation in order to detect and fix up changes to branches. One more for the road:

$ findc -type f ! \( -executable -o -name '*.exe' \)
// path: .
        action  -type f
        braf    L1
        action  -executable
        brat    L2
        action  -name *.exe
L2:     not
L1:     braf    L3
        action  -print
L3:     halt

The unoptimal jumps hint at my compiler’s structure. If you’re feeling up for a challenge, pause here to consider how you’d build this compiler, and how it might produce these particular artifacts.

Parsing and compiling

Before I even considered the shape of the bytecode I knew I needed to convert find infix into a compiler-friendly postfix. That is, this:

-type f -a ! ( -executable -o -name *.exe )

Becomes:

-type f -executable -name *.exe -o ! -a

Which, importantly, erases the parentheses. This comes in as an argv array, so it’s already tokenized for us by the shell or runtime. The classic shunting-yard algorithm solves this problem easily enough. We have an output queue that goes into the compiler, and a token stack for tracking -a, -o, !, and (. Then we walk argv in order:

  • Actions go straight into the output queue.

  • If we see one of the special stack tokens we push it onto the stack, first popping operators with greater precedence into the queue, stopping at (.

  • If we see ) we pop the stack into the output queue until we see (.

When we’re out of tokens, pop the remaining stack into the queue. My parser synthesizes -a where it’s implied, so the compiler always sees logical AND. If the expression contains no -exec, -ok, or -print, after processing is complete the parser puts -print then -a into the queue, which effectively wraps the whole expression in ( expr ) -print. By clearing the stack first, the real expression is effectively wrapped in parentheses, so no parenthesis tokens need to be synthesized.

I’ve used the shunting-yard algorithm many times before, so this part was easy. The new part was coming up with an algorithm to convert a series of postfix tokens into bytecode. My solution is the compiler maintains a stack of bytecode fragments. That is, each stack element is a sequence of one or more bytecode instructions. Branches use relative addresses, so they’re position-independent, and I can concatenate code fragments without any branch fix-ups. It takes the following actions from queue tokens:

  • For an action token, create an action instruction, and push it onto the fragment stack as a new fragment.

  • For a ! token, pop the top fragment, append a not instruction, and push it back onto the stack.

  • For a -a token, pop the top two fragments, join then with a braf in the middle which jumps just beyond the second fragment. That is, if the first fragment evaluates to false, skip over the second fragment into whatever follows.

  • For a -o token, just like -a but use brat. If the first fragment is true, we skip over the second fragment.

If the expression is valid, at the end of this process the stack contains exactly one fragment. Append a halt instruction to this fragment, and that’s our program! If the final fragment contained a branch just beyond its end, this halt is that branch target. A few peephole optimizations and could probably be an optimal program for this instruction set.

联系我们 contact @ memedata.com