Non-Deterministic Finite Automata in Regular Expression Engine

Regular Expression Engine

Regular expression engine can be divided into 2 different categories: DFA (Deterministic finite automaton) and NFA( Non-deterministic finite automaton). The NFA can be divided into Traditional NFA and POSIX NFA.

Backtracking is allowed in DFA, while in NFA, it may or may not be allowed, so NFA is faster than DFA. NFA doesn't support capture groups.

Language and tools that use the DFA: Awk, egrep and Lex.

POSIX NFA

POSIX or "Portable Operating System Interface for UNIX" is a collection of standards that define some of the functionality that a (UNIX) operating system should support. One of these standards defines two flavors of regular expressions. Commands involving regular expressions, such as grep and egrep, implement these flavors on POSIX-compliant UNIX systems. Several database systems also use POSIX regular expressions.

POSIX NFA support longest-leftmost match, the DFA finds the longest possible match, period. Since it's the longest from among all possible matches that start equally furthest to the left, it's the "longest-leftmost" match.

Most languages ​and tools use the traditional NFA engine, which has some features not supported by DFA:

  • Capture groups and reverse
  • Lookaround (pre-search): (?<=...), (?<!...), (?=...), (?!...)
  • The non-greedy (Ignore optional quantifiers): ??, *?, +?, {m,n}?, {m,}?

The composition of the string

For the string abc, includes three characters and 4 positions ([position 0]a[position 1]b[position 2]c[position 3]).

Character possession and zero-width

In the regular expression matching process, if the sub-expression matches the content of the characters, rather than position, and is saved to the final match results, this sub-expression is character possession. If the sub-expression match the position, or the matched content isn't saved in the final match result, that sub-expression is zero-width.

Character possession is exclusive, zero-width is non-exclusive. In other words, a character that can only be matched by a sub-expression at the same time, a position can be matched by multiple zero-width sub-expressions at the same time.

Control and drive

In the regular matching process, a sub-expression (which may be a normal character, meta-characters or meta-characters sequence) get the control permission, try to match from a position in the string. The beginning position of a sub-expression try to match is the previous sub-expression success match end position. Such as regular expression:

(sub-expression 1)(sub-expression 2)

If the (sub-expression 1) is a zero-width, since it matches beginning and end at the same position, for example position 0, so the (sub-expression 2) trying to match from position 0

If the (sub-expression 1) is character possession expression, since it matches the beginning and end at different position, for example the success of the match starts at position 0, at the end position 2, in this case the (sub-expression 2) should be starting match from position 2.

For the entire expression, it's usually trying to match from position 0. If the entire expression match fails in some position, the regular engine will drive forwards, the entire expression will retry matching from position 1, and so on until matching successful or failed at the last position is tried.

Basic matching

Source string: abc
Regular expression: abc

Matching process:

The character a get control permission, begin match from position 0, a matched a success, the character b get control permission and match from position 1, b matched b, the character c get control permission and matched c.

The regular expression matching is complete, matches successfully. The matching result is abc, the starting position is 0, and the ending position is 3.

Match priority quantifiers

Source string: abc
Regular expression: ab?c

The quantifier ? is match priority quantifier, it will first try to match if can be match and can't match. The quantifier ? is used to modify the character b, so b? is together.

Matching process:

The character a get control permission, begin match from position 0, a matched a, match success, the character b? get control permission, ? is match priority quantifier it will try to match, b? matched b success, the character c get control permission and record an alternate status at the same time, c matched c success, discard alternative status.

The regular expression matching is complete, matches successfully. The matching result is abc, the starting position is 0, and the ending position is 3.

Source string: ac
Regular expression: ab?c

Matching process:

The character a get control permission, begin match from position 0, a matched a, match success, the character b? get control permission will try to match, b? matched c failed and record an alternate status, backtracking and find alternative status, b? match end, give up control permission, the character c get control permission and matched c.

Zero-width matching

Source string: a12
Regular expression: ^(?=[a-z])[a-z0-9]+$

Meta-characters ^ and $ only match the position. Order lookaround (?=[a-z]) just match and not take character possession, and also not save the matching contents to the results, so it's zero-width.

Matching process:

The meta-characters ^ get control permission, begin match from position 0, match success.

Order lookaround​ (?=[a-z]) get control permission, it requires must have a letter on the right of its location, since zero-width sub-expressions are not exclusive (the same position can be matched by multiple zero-width sub-expressions at the same time), so it will begin match from position 0. The right side of the position 0 is the character a, match success.

Order lookaround (?=[a-z]) only match the position and not save the matching contents to the results, match result is position 0, [a-z0-9]+ get control permission, and also begin match from position 0, it will try to match a, match success. Then successfully match the next 1 and 2, which now matches to position 3, and there is no character on the right of position 3.

The meta-characters $ get control permission, begin match from position 3, match success.

The regular expression matching is complete, matches successfully. The matching result is a12, the starting position is 0, and the ending position is 3. ^ and (?=[a-z]) matched the position 0, [a-z0-9]+ matched the string a12, $ matched the position 3.

Non-Deterministic Finite Automata in Regular Expression Engine
3 votes, 4.33 avg. rating (88% score)