I haven’t shipped any new features for Quamina in many months, partly due to a flow of real-life distractions, but also I’m up against tough performance problems in implementing Regular Expressions at massive scale. I’m still looking for a breakthrough, but have learned things about building and executing finite automata that I think are worth sharing. This piece has to do with epsilons; anyone who has studied finite automata will know about them already, but I’ll offer background for those people to skip.
I’ve written about this before in
Epsilon Love. A commenter pointed out that the definition of “epsilon”
in that piece is not quite right per standard finite-automata theory, but it’s still a useful in that it describes how
epsilons support constructs like the shell-style “*
”.
Background · Finite automata come in two flavors: Deterministic (DFA) and Nondeterministic (NFA). DFAs move from state to state one input symbol at a time: it’s simple and easy to understand and to implement. NFAs have two distinguishing characteristics: First, when you’re in a state and an input symbol arrives, you can transfer to more than one other state. Second, a state can have “epsilon transitions” (let’s say “ε” for epsilon), which can happen any time at all while you’re in that state, input or no input.
NFAs are more complicated to traverse (will discuss below) but you need them if you want to
implement regular expressions with .
and ?
and *
and so on. You can turn any NFA into a
DFA, and I’ll come back to that subject in a future piece.
For implementing NFAs, I’ve been using Thompson's construction, where “Thompson” is Ken Thompson, co-parent of Unix. This technique is also nicely described by Russ Cox in Regular Expression Matching Can Be Simple And Fast. You don’t need to learn it to understand this piece, but I’ll justify design choices by saying “per Thompson”.
I’m going to discuss two specific issues today, ε-closures and a simpler NFA definition.
ε-closures ·
To set the stage, consider this regexp: A?B?C?X
It should match “X” and “BX” and “ACX” and so on, but not “CAX” or “XX”.
Thompson says that you implement A?
with a
transition to the next state on “A” and another ε-transition to that next state; because if you see an “A” you should transition,
but then you can transition anyhow even if you don’t.
The resulting NFA looks like this:
In finite-automaton
math, states are usually represented by the letter “q” followed by a number (usually italicized and subscripted, like
q0, but not here, sorry). Note q4
’s double circle which means it’s
a goal state, i.e. if we get here we’ve matched the regexp.
I should add that this was produced with
draw.io, which seems to make this sort of thing easy.
Back to that NFA · So, here’s a challenge: Sketch out the traversal code in your head. Think about the input strings “AX” and “BCX” and just “X” and how you’d get through the NFA to the Q4 goal state.
The trick is what’s called the ε-closure. When you get to a state, before you look at the next input symbol, you have to set
up to process it. In this case, you need to be able to transition on an A or B or C. So what you do is pull together the
start state q0
and also any other states you can reach from there through ε-transitions. In this case, the ε-closure
for the start state is {q0, q1, q2, q3}
.
Suppose, then, that you see a “B” input symbol. You apply it to all the states in the ε-closure. Only q1
matches,
transitioning you to
q2
. Before you look at the next input symbol, you compute the ε-closure for q2
, which turns
out to be {q2, q3}
. With this ε-closure, you can match “C” or “X”. If you get a “C”, you”ll step to
q3
, whose ε-closure is just itself, because “X” is the only path forward.
So your NFA-traversal algorithm for one step becomes something like:
Start with a list of states.
Compute the ε-closure of that that list.
Read an input symbol.
For each state in the ε-closure, see if you can traverse to another state.
If so, add it to your output list of states.
When you’re done, your output list of states is the input to this algorithm for the next step.
Computation issues ·
Suppose your regular expression is (A+BC?)+
. I’m not going to sketch out the NFA, but just looking at it tells
you that it has to have loopbacks; once you’ve matched the parenthetized chunk you need to go back to a state where you can
recognize another occurrence.
For this regexp’s NFA, computing the ε-closures can lead you into an infinite loop. (Should be obvious, but I didn’t realize it until
after the first time it happened.)
You can have loops and you can also have dupes. In practice, it’s not that uncommon for a state to have more than one ε-transition, and for the targets of these transitions to overlap.
So you need to watch for loops and to dedupe your output. I think the only way to avoid this is with a cookie-crumbs “where I’ve been” trail, either as a list or a hash table.
Both of these are problematic because they require allocating memory, and that’s something you really don’t want to do when you’re trying to match patterns to events at Quamina’s historic rate of millions per second.
I’ll dig into this problem in a future Quamina-Diary outing, but obviously, caching computed epsilon closures would avoid re-doing this computation.
Anyhow, bear ε-closures in mind, because they’ll keep coming up as this series goes on.
And finally, simplifying “NFA” · At the top of this piece, I offered the standard definition of NFAs: First, when you’re in a state and an input symbol arrives, you can transfer to more than one other state. Second, you can have ε-transitions. Based on my recent work, I think this definition is redundant. Because if you need to transfer to two different states on some input symbol, you can do that with ε-transitions.
Here’s a mini-NFA that transfers from state q0
on “A” to both q1
and q2
.
And here’s how you can achieve the same effect with ε-transitions:
In that NFA, in qS
the “S” stands for “splice”, because it’s a state that exists to connect
two threads of finite-automaton traversal.
I’m pretty sure that this is more than just a mathematical equivalence. In my regexp implementation, so far at least,
I’ve never encountered a need to do that first kind of dual transition. Furthermore, the “splice” structure is how Thompson
implements the regular-expression “|
” operator.
So if you’re building an NFA, all the traversal stuff you need in a state is a simple map from input symbol to next state, and a list of ε-transitions.
Next up · How my own implementation of NFA traversal collided head-on into the Benchmark From Hell and still hasn’t recovered.