The first stage in a compiler is the lexical analyser (or lexer, scanner). It converts a stream of input characters (in a compiler, it’s the entire source program) into a stream of tokens, which are passed to the parser. Tokens have values (lexemes) that specify the token type.

Generally, lexical analysers:

  • Delete whitespace and comments.
  • Identify lexical tokens in the source stream.
  • Transmit these tokens to the parser.
  • Transmit token metadata (file name, line number, column number) forward.
  • Generate errors for (and possibly recover from) ill-formed tokens.

To write a lexer, we follow the following steps. Note that there are many lexer generators that allow us to automate steps 3-7 (ANTLR, etc).

  1. Determine what tokens should be in the target language.
  2. Express these tokens as regular expressions.
  3. Convert the regexes to a non-deterministic finite automaton.
  4. Convert the NFA into a deterministic finite automaton.
  5. Minimise the DFA (optional).
  6. Encode the DFA in a transition table.
  7. Write code to scan input and use table to identify tokens/errors.