The first stage in a compiler is the lexical analyser (or lexer, scanner). It converts a stream of input characters (in a compiler, it’s the entire source program) into a stream of tokens, which are passed to the parser. Tokens have values (lexemes) that specify the token type.
Generally, lexical analysers:
- Delete whitespace and comments.
- Identify lexical tokens in the source stream.
- Transmit these tokens to the parser.
- Transmit token metadata (file name, line number, column number) forward.
- Generate errors for (and possibly recover from) ill-formed tokens.
To write a lexer, we follow the following steps. Note that there are many lexer generators that allow us to automate steps 3-7 (ANTLR, etc).
- Determine what tokens should be in the target language.
- Express these tokens as regular expressions.
- Convert the regexes to a non-deterministic finite automaton.
- Convert the NFA into a deterministic finite automaton.
- Minimise the DFA (optional).
- Encode the DFA in a transition table.
- Write code to scan input and use table to identify tokens/errors.