Thompson's Construction for Regular Expressions
Enter a regular expression below to visualize its corresponding non-deterministic finite automaton. The automaton is constructed using Thompson's algorithm, which produces an NFA with exactly one start state and one accept state.
Enter a regular expression and click "Construct" to visualize the automaton.
Regular expressions and finite automata are two of the most fundamental concepts in theoretical computer science. Though they appear quite different—one a concise textual notation, the other a graphical state machine—they possess equivalent computational power. Any pattern that can be described by a regular expression can also be recognized by a finite automaton, and vice versa.
This equivalence is not merely theoretical; it underlies the implementation of regular expression engines in programming languages, text editors, and compilers. When you use a regex in Python or JavaScript, the engine often converts your pattern into an automaton behind the scenes to perform efficient matching.
A language is regular if and only if it can be recognized by a finite automaton. Equivalently, a language is regular if and only if it can be described by a regular expression. The class of regular languages is closed under union, concatenation, and Kleene star.
A regular expression (often abbreviated regex or regexp) is a sequence of characters that defines a search pattern. Regular expressions were first formalized by mathematician Stephen Cole Kleene in 1951 as a notation for describing regular languages—the simplest class of formal languages in the Chomsky hierarchy.
A regular expression over an alphabet Σ is defined inductively:
| Operator | Name | Description | Example |
|---|---|---|---|
| | | Alternation (Union) | Matches either the left or right operand | a|b matches "a" or "b" |
| · (implicit) | Concatenation | Matches the left operand followed by the right | ab matches "ab" |
| * | Kleene Star | Matches zero or more repetitions | a* matches "", "a", "aa", ... |
| + | Kleene Plus | Matches one or more repetitions | a+ matches "a", "aa", "aaa", ... |
| ? | Optional | Matches zero or one occurrence | a? matches "" or "a" |
Note: The operators + and ? are syntactic sugar: R+ ≡ RR* and R? ≡ (R|ε).
A finite automaton (plural: automata) is an abstract machine that can be in exactly one of a finite number of states at any given time. The automaton reads an input string symbol by symbol, transitioning between states according to a transition function. If, after reading the entire input, the machine is in an accepting state, the string is said to be accepted by the automaton.
An NFA is a 5-tuple (Q, Σ, δ, q₀, F) where:
There are two main types of finite automata:
Despite their differences, NFAs and DFAs recognize exactly the same class of languages—the regular languages. Any NFA can be converted to an equivalent DFA (via the subset construction), though the DFA may have exponentially more states.
The symbol ε (epsilon) represents the empty string—a string of length zero. An ε-transition allows the automaton to change states without consuming any input symbol. This non-deterministic feature makes NFAs more compact and easier to construct algorithmically, though it requires special handling during simulation (computing the ε-closure of states).
Thompson's construction, developed by Ken Thompson in 1968, is an algorithm for converting a regular expression into an equivalent NFA. The algorithm is syntax-directed: it recursively builds NFAs for subexpressions and combines them according to the regex operators.
The NFA produced by Thompson's construction has several elegant properties:
Thompson's algorithm defines NFA fragments for base cases and shows how to combine them:
While there are other algorithms for regex-to-NFA conversion, Thompson's construction is favored for several reasons: