eceIn computer science, a regular expression (regex) is a method to define a formal language. Practically speaking, regex is used in software engineering to define a description of a pattern of text to search for. Formally, we use a recursive definition to define the following regular expressions over an alphabet :1

  • Base: (empty set), (empty string), and each symbol individually in are regular expressions over .
  • Recursion: if and are regular expressions over , then , , and (i.e., concatenation, alternation, or the Kleene closure) are also regular expressions over .
    • The “recursion” here comes from the idea that any more “complicated” regex can be recursively broken down into base operators one-by-one, observing their precedence.
    • Then, we can define the Kleene closure given a regex with: .
  • Restriction: nothing except for the above are regular expressions over .

Given a finite alphabet, the set of sentences obtained by a regular expression defines a formal language that is generated/accepted by , denoted .

Every regular expression over the alphabet defines a formal language.

The basics

Remember our basic formal language operators. In order of precedence:

  • Set of characters: denoted with square brackets (i.e., [a, z])
  • Grouping/scoping: denoted with parentheses.
  • Repetition:
    • Kleene closure: *, which denotes zero or more instances of the substring.
    • +, which denotes at least one or more instances. b+ == bb*
  • Concatenation: substrings next to each other.
  • Alternation (logical OR): | vertical pipe.

Finite automata

Finite state automata are able to recognise sentences specified by a regex. By Kleene’s theorem, for any formal language that is accepted by an FSM, there is a regular expression that defines the same language. The inverse case is true (for any regex, there is an FSM), provable with DFAs. All the fundamental regex operators can be converted to an NFA. For the following, any connecting intermediate accepting states become regular states. For the concatenation operator:

https://www.autoregex.xyz/

The basics

\d refers to a decimal number. We can group within our expression with parentheses.

In programming

In Python, we use the re module. Any regex expressions should be wrapped in apostrophes and beginning with an r: r'\d\d\d' will search for three numbers. Some applicable methods:

  • compile(r'expression) will compile a regex object to a variable.
  • a = var.search('string') will search for the string within the regex object var.
  • a.group() will output whatever’s been found.
    • If our regex expression groups with parentheses, we can use indices.
    • groups() will output into a tuple.

Footnotes

  1. From Discrete Mathematics with Applications, by Susanna Epp.