I haven’t shipped any new features for Quamina in many months, partly due to a flow of real-life distractions, but also I’m up against tough performance problems in implementing Regular Expressions at massive scale. I’m still looking for a breakthrough, but have learned things about building and executing finite automata that I think are worth sharing. This piece has to do with epsilons; anyone who has studied finite automata will know about them already, but I’ll offer background for those people to skip. I’ve written about this before in Epsilon Love. A commenter pointed out that the definition of “epsilon” in that piece is not quite right per standard finite-automata theory, but it’s still a useful in that it describes how epsilons support constructs like the shell-style “*”. Background · Finite automata come in two flavors: Deterministic (DFA) and Nondeterministic (NFA). DFAs move from state to state one input symbol at a time: it’s simple and easy to understand and to implement. NFAs have two distinguishing characteristics: First, when you’re in a state and an input symbol arrives, you can transfer to more than one other state. Second, a state can have “epsilon transitions” (let’s say “ε” for epsilon), which can happen any time at all while you’re in that state, input or no input. NFAs are more complicated to traverse (will discuss below) but you need them if you want to implement regular expressions with . and ? and * and so on. You can turn any NFA into a DFA, and I’ll come back to that subject in a future piece. For implementing NFAs, I’ve been using Thompson's construction, where “Thompson” is Ken Thompson, co-parent of Unix. This technique is also nicely described by Russ Cox in Regular Expression Matching Can Be Simple And Fast. You don’t need to learn it to understand this piece, but I’ll justify design choices by saying “per Thompson”. I’m going to discuss two specific issues today, ε-closures and a simpler NFA definition. ε-closures · To set the stage,...
First seen: 2025-07-09 21:36
Last seen: 2025-07-10 00:37