QRS:Epsilon争吵
QRS: Epsilon Wrangling

原始链接: https://www.tbray.org/ongoing/When/202x/2025/07/07/Epsilon-Wrangling

Quamina在大规模实施正则表达式方面面临绩效挑战。核心问题围绕着有效地穿越非确定性有限自动机(NFA),尤其是ε-锁骨的概念。 ε-斜率涉及通过εTransitions识别从给定状态达到的所有状态(不消耗输入的过渡)。在处理任何输入符号之前,计算这些封闭是必不可少的,但是它可以导致无限的循环和重复的状态,因此需要记忆密集型解决方案,例如缓存。 作者提出了一个简化的NFA定义,通过证明传统的“双重过渡”(可以在单个输入上移动到多个状态)可以使用εTransitions和“ splice”状态复制。这将NFA状态的内部表示形式简化为下一个状态的输入符号的地图和ε-转移列表。 下一步将探讨作者的NFA实施方式如何与艰难的基准和所学的经验教训相撞。

黑客新闻讨论围绕正则实施和优化。用户O11C分享了他们构建简​​化的正则发动机的经验,该发动机避免了NFA,选择DFA方法,找到诸如`(AB)+|(AC)+'具有挑战性的复杂模式。他们强调了Python的性能挑战,特别是用基本的编程结构进行的延迟。 克拉根(Kragen)建议采用替代方法,包括简洁的钉子解析器和罗布·派克(Rob Pike)的模式匹配器,强调了时间和空间效率之间的权衡。它们链接到代码示例。 O11C回答要反对为简单性而牺牲性能和空间效率的方法,并提倡达到平衡的代码。 克拉根(Kragen)通过说明空间和时间之间的不可避免的权衡来反驳,并表达了对快速和空间有效的PEG解析潜力的信念,尽管缺乏例子。
相关文章

原文

I haven’t shipped any new features for Quamina in many months, partly due to a flow of real-life distractions, but also I’m up against tough performance problems in implementing Regular Expressions at massive scale. I’m still looking for a breakthrough, but have learned things about building and executing finite automata that I think are worth sharing. This piece has to do with epsilons; anyone who has studied finite automata will know about them already, but I’ll offer background for those people to skip.

I’ve written about this before in Epsilon Love. A commenter pointed out that the definition of “epsilon” in that piece is not quite right per standard finite-automata theory, but it’s still a useful in that it describes how epsilons support constructs like the shell-style “*”.

Background · Finite automata come in two flavors: Deterministic (DFA) and Nondeterministic (NFA). DFAs move from state to state one input symbol at a time: it’s simple and easy to understand and to implement. NFAs have two distinguishing characteristics: First, when you’re in a state and an input symbol arrives, you can transfer to more than one other state. Second, a state can have “epsilon transitions” (let’s say “ε” for epsilon), which can happen any time at all while you’re in that state, input or no input.

NFAs are more complicated to traverse (will discuss below) but you need them if you want to implement regular expressions with . and ? and * and so on. You can turn any NFA into a DFA, and I’ll come back to that subject in a future piece.

For implementing NFAs, I’ve been using Thompson's construction, where “Thompson” is Ken Thompson, co-parent of Unix. This technique is also nicely described by Russ Cox in Regular Expression Matching Can Be Simple And Fast. You don’t need to learn it to understand this piece, but I’ll justify design choices by saying “per Thompson”.

I’m going to discuss two specific issues today, ε-closures and a simpler NFA definition.

ε-closures · To set the stage, consider this regexp: A?B?C?X

It should match “X” and “BX” and “ACX” and so on, but not “CAX” or “XX”. Thompson says that you implement A? with a transition to the next state on “A” and another ε-transition to that next state; because if you see an “A” you should transition, but then you can transition anyhow even if you don’t.

The resulting NFA looks like this:

In finite-automaton math, states are usually represented by the letter “q” followed by a number (usually italicized and subscripted, like q0, but not here, sorry). Note q4’s double circle which means it’s a goal state, i.e. if we get here we’ve matched the regexp. I should add that this was produced with draw.io, which seems to make this sort of thing easy.

Back to that NFA · So, here’s a challenge: Sketch out the traversal code in your head. Think about the input strings “AX” and “BCX” and just “X” and how you’d get through the NFA to the Q4 goal state.

The trick is what’s called the ε-closure. When you get to a state, before you look at the next input symbol, you have to set up to process it. In this case, you need to be able to transition on an A or B or C. So what you do is pull together the start state q0 and also any other states you can reach from there through ε-transitions. In this case, the ε-closure for the start state is {q0, q1, q2, q3}.

Suppose, then, that you see a “B” input symbol. You apply it to all the states in the ε-closure. Only q1 matches, transitioning you to q2. Before you look at the next input symbol, you compute the ε-closure for q2, which turns out to be {q2, q3}. With this ε-closure, you can match “C” or “X”. If you get a “C”, you”ll step to q3, whose ε-closure is just itself, because “X” is the only path forward.

So your NFA-traversal algorithm for one step becomes something like:

  1. Start with a list of states.

  2. Compute the ε-closure of that that list.

  3. Read an input symbol.

  4. For each state in the ε-closure, see if you can traverse to another state.

  5. If so, add it to your output list of states.

  6. When you’re done, your output list of states is the input to this algorithm for the next step.

Computation issues · Suppose your regular expression is (A+BC?)+. I’m not going to sketch out the NFA, but just looking at it tells you that it has to have loopbacks; once you’ve matched the parenthetized chunk you need to go back to a state where you can recognize another occurrence. For this regexp’s NFA, computing the ε-closures can lead you into an infinite loop. (Should be obvious, but I didn’t realize it until after the first time it happened.)

You can have loops and you can also have dupes. In practice, it’s not that uncommon for a state to have more than one ε-transition, and for the targets of these transitions to overlap.

So you need to watch for loops and to dedupe your output. I think the only way to avoid this is with a cookie-crumbs “where I’ve been” trail, either as a list or a hash table.

Both of these are problematic because they require allocating memory, and that’s something you really don’t want to do when you’re trying to match patterns to events at Quamina’s historic rate of millions per second.

I’ll dig into this problem in a future Quamina-Diary outing, but obviously, caching computed epsilon closures would avoid re-doing this computation.

Anyhow, bear ε-closures in mind, because they’ll keep coming up as this series goes on.

And finally, simplifying “NFA” · At the top of this piece, I offered the standard definition of NFAs: First, when you’re in a state and an input symbol arrives, you can transfer to more than one other state. Second, you can have ε-transitions. Based on my recent work, I think this definition is redundant. Because if you need to transfer to two different states on some input symbol, you can do that with ε-transitions.

Here’s a mini-NFA that transfers from state q0 on “A” to both q1 and q2.

And here’s how you can achieve the same effect with ε-transitions:

In that NFA, in qS the “S” stands for “splice”, because it’s a state that exists to connect two threads of finite-automaton traversal.

I’m pretty sure that this is more than just a mathematical equivalence. In my regexp implementation, so far at least, I’ve never encountered a need to do that first kind of dual transition. Furthermore, the “splice” structure is how Thompson implements the regular-expression “|” operator.

So if you’re building an NFA, all the traversal stuff you need in a state is a simple map from input symbol to next state, and a list of ε-transitions.

Next up · How my own implementation of NFA traversal collided head-on into the Benchmark From Hell and still hasn’t recovered.


联系我们 contact @ memedata.com