Lexical analysis

The first stage in a compiler is the lexical analyser (or lexer, scanner). It converts a stream of input characters (in a compiler, it’s the entire source program) into a stream of tokens, which are passed to the parser. Tokens have values (lexemes) that specify the token type.

Generally, lexical analysers:

Delete whitespace and comments.
Identify lexical tokens in the source stream.
Transmit these tokens to the parser.
Transmit token metadata (file name, line number, column number) forward.
Generate errors for (and possibly recover from) ill-formed tokens.

To write a lexer, we follow the following steps. Note that there are many lexer generators that allow us to automate steps 3-7 (ANTLR, etc).

Determine what tokens should be in the target language.
Express these tokens as regular expressions.
Convert the regexes to a non-deterministic finite automaton.
Convert the NFA into a deterministic finite automaton.
Minimise the DFA (optional).
Encode the DFA in a transition table.
Write code to scan input and use table to identify tokens/errors.

jszhn

Recent Notes

Accounting method

Adjugate matrix

Algorithm

Algorithmic analysis

Alma Linux

Lexical analysis

Graph View

Backlinks