Finite Automaton Visualization

Thompson's Construction for Regular Expressions

§1Interactive Visualizer

Enter a regular expression below to visualize its corresponding non-deterministic finite automaton. The automaton is constructed using Thompson's algorithm, which produces an NFA with exactly one start state and one accept state.

Examples:
Figure 1. Non-deterministic Finite Automaton (NFA)

Enter a regular expression and click "Construct" to visualize the automaton.

Supported Syntax

a-z, A-Z, 0-9 — Literal characters
. — Any single character
| — Alternation (union)
* — Kleene star (zero or more)
+ — Kleene plus (one or more)
? — Optional (zero or one)
( ) — Grouping
\ — Escape special characters

§2Introduction

Regular expressions and finite automata are two of the most fundamental concepts in theoretical computer science. Though they appear quite different—one a concise textual notation, the other a graphical state machine—they possess equivalent computational power. Any pattern that can be described by a regular expression can also be recognized by a finite automaton, and vice versa.

This equivalence is not merely theoretical; it underlies the implementation of regular expression engines in programming languages, text editors, and compilers. When you use a regex in Python or JavaScript, the engine often converts your pattern into an automaton behind the scenes to perform efficient matching.

Kleene's Theorem (1956)

A language is regular if and only if it can be recognized by a finite automaton. Equivalently, a language is regular if and only if it can be described by a regular expression. The class of regular languages is closed under union, concatenation, and Kleene star.

§3Regular Expressions

A regular expression (often abbreviated regex or regexp) is a sequence of characters that defines a search pattern. Regular expressions were first formalized by mathematician Stephen Cole Kleene in 1951 as a notation for describing regular languages—the simplest class of formal languages in the Chomsky hierarchy.

Definition: Regular Expression

A regular expression over an alphabet Σ is defined inductively:

  • is a regex denoting the empty language
  • ε is a regex denoting the language {ε} containing only the empty string
  • For each a ∈ Σ, the symbol a is a regex denoting {a}
  • If R and S are regexes, then (R|S), (RS), and (R*) are regexes

Operators and Their Meanings

Operator Name Description Example
| Alternation (Union) Matches either the left or right operand a|b matches "a" or "b"
· (implicit) Concatenation Matches the left operand followed by the right ab matches "ab"
* Kleene Star Matches zero or more repetitions a* matches "", "a", "aa", ...
+ Kleene Plus Matches one or more repetitions a+ matches "a", "aa", "aaa", ...
? Optional Matches zero or one occurrence a? matches "" or "a"

Note: The operators + and ? are syntactic sugar: R+ ≡ RR* and R? ≡ (R|ε).

§4Finite Automata

A finite automaton (plural: automata) is an abstract machine that can be in exactly one of a finite number of states at any given time. The automaton reads an input string symbol by symbol, transitioning between states according to a transition function. If, after reading the entire input, the machine is in an accepting state, the string is said to be accepted by the automaton.

Definition: Non-deterministic Finite Automaton (NFA)

An NFA is a 5-tuple (Q, Σ, δ, q₀, F) where:

  • Q is a finite set of states
  • Σ is a finite alphabet of input symbols
  • δ: Q × (Σ ∪ {ε}) → P(Q) is the transition function
  • q₀ ∈ Q is the initial (start) state
  • F ⊆ Q is the set of accepting (final) states

NFA vs. DFA

There are two main types of finite automata:

Despite their differences, NFAs and DFAs recognize exactly the same class of languages—the regular languages. Any NFA can be converted to an equivalent DFA (via the subset construction), though the DFA may have exponentially more states.

The Epsilon Transition

The symbol ε (epsilon) represents the empty string—a string of length zero. An ε-transition allows the automaton to change states without consuming any input symbol. This non-deterministic feature makes NFAs more compact and easier to construct algorithmically, though it requires special handling during simulation (computing the ε-closure of states).

§5Thompson's Construction

Thompson's construction, developed by Ken Thompson in 1968, is an algorithm for converting a regular expression into an equivalent NFA. The algorithm is syntax-directed: it recursively builds NFAs for subexpressions and combines them according to the regex operators.

Properties of Thompson's NFAs

The NFA produced by Thompson's construction has several elegant properties:

  • Exactly one start state with no incoming transitions
  • Exactly one accepting state with no outgoing transitions
  • At most 2n states for a regex of length n
  • At most 4n transitions
  • Each state has at most two outgoing transitions

Construction Rules

Thompson's algorithm defines NFA fragments for base cases and shows how to combine them:

  1. Symbol a: Create two states connected by a transition labeled a. The first state is the start; the second is the accept state.
  2. Concatenation RS: Connect the accept state of NFA(R) to the start state of NFA(S) by merging them. The start state of NFA(R) becomes the new start; the accept state of NFA(S) becomes the new accept.
  3. Alternation R|S: Create a new start state with ε-transitions to the start states of both NFA(R) and NFA(S). Create a new accept state with ε-transitions from the accept states of both NFAs.
  4. Kleene Star R*: Create new start and accept states. Add ε-transitions: (1) from new start to NFA(R)'s start, (2) from new start to new accept, (3) from NFA(R)'s accept back to NFA(R)'s start, and (4) from NFA(R)'s accept to new accept.

Why Thompson's Construction?

While there are other algorithms for regex-to-NFA conversion, Thompson's construction is favored for several reasons:

§6Glossary of Terms

Alphabet (Σ) A finite, non-empty set of symbols from which strings are formed. For example, {a, b} or {0, 1}.
String A finite sequence of symbols from an alphabet. Also called a word. The length of string w is denoted |w|.
Empty String (ε) The unique string of length zero, containing no symbols. Also written as λ in some texts.
Language A (possibly infinite) set of strings over an alphabet. A language L ⊆ Σ*.
Regular Language A language that can be expressed by a regular expression, or equivalently, recognized by a finite automaton.
State A configuration of the automaton. Represented as a node (circle) in the state diagram.
Initial State The state in which the automaton begins processing input. Denoted by an incoming arrow with no source.
Accepting State A state indicating that the input read so far forms a valid string in the language. Denoted by a double circle.
Transition A directed edge between states, labeled with a symbol. Represents the automaton's response to reading that symbol.
ε-Transition A transition that occurs without consuming any input symbol. Enables non-determinism.
ε-Closure The set of all states reachable from a given state via zero or more ε-transitions.
Non-determinism The property of having multiple possible transitions for a given state and input, or having ε-transitions.
Kleene Star (*) An operation producing zero or more concatenations of a language with itself: L* = {ε} ∪ L ∪ LL ∪ LLL ∪ ...