Course introduces the design of compilers and interpreters. Laboratory work is done in ANTLR with C++.
Concepts covered
- Lexical analysis
- Formal language
- Operators (concatenation, alternation, Kleene closure)
- Regular expression
- Finite state automata
- Deterministic finite automata (DFA), non-deterministic finite automata (NFA)
- Regex to NFA conversions
- NFA to DFA conversions (epsilon-closures)
- Formal language
- Parsing
- Grammar
- Context-free grammar
- Extended Backus-Naur form
- Derivations, grammar ambiguity
- Parse tree
- Parsing table
- Parsing algorithms
- Top-down parsing
- LL(1)
- First, follow sets
- Bottom-up parsing
- Shift-reduce operations
- LR(0), LR(1)
- Top-down parsing
- Parser states
- State closures
- goto function
- Grammar
- Abstract syntax tree
- Construction via actions, visitors
- Symbol table
- Hash table, linked list
- Name arrays
- Type tables
- Semantic analysis
- Data type (type systems)
- Type equivalence
- Type compatibility
- Type inference
- Intermediate representation (IR)
- LLVM intermediate representation
- Compiler optimisation
Extra notes
Parsing lectures
- When at a non-terminal, we need to add all transitions such that appears.
- For each possible next character, we add a new state + transition edge.
- And move forward the dot.
- We don’t add duplicate states - no reason for us to do that. Just point multiple edges towards a given state
- $ == accept state (double rings)
- Remember neither terminal nor non-terminal.
- if we have , we get no reduction,
- why reduce under every token?
- irrespective of what input token is, we reduce
- based solely on what’s on the stack
- conflicts
- two dots at the end of different productions == ambiguity
7 Oct 2024
- slide 139
- two choices: we could’ve either done E-E or E-
- but because we reduce first we lose the ability to reduce E - E in LR(0)
- slide 141
- conflicts in state 2
- slide 143
- LR(1): largely extends on LR(0)
- conflicts avoidable with a lookahead
- slide 144
- LR(1): more states compared to LR(0), but allows us to distinguish between conflicts
- slide 145
- EBNF colons == right arrow
NEWLINE: [\r\n]+;
- slide 146
- resolves first rule violations then moves down the file
- AST slides
- Abstract syntax tree: simplifies the parse tree.
- Removes elements necessary within the grammar but are otherwise not necessary practically. We don’t change the program’s meaning/operations.
- We also embed the AST with lots of metadata.
- slide 7
- right, children pointers
- we use a Threaded tree to add pointers to the parent
- slide 8
- alternatively, we could maintain a vector of child pointers
- we keep threading to parents
- note: compiler references usually quite old, optimise for memory. now we don’t really care so we don’t have to optimise too much
- slide 11
- basically embedding C++ code within an ANTLR4 file
- each given line of C++ wrapped in curly brackets is an action
- token attributes: useful values we can embed within C++
10 Oct 2024
- exam logistics
- past midterms: don’t expect proof questions in scope of current midterm
- lab logistics
- marks to come out soon, can check all test cases after submission deadline with exercise command
- two ways to build AST: incrementally during parse tree construction, or after parse tree is built
- ANTLR allows us to build it during parse tree construction
@header
prepends code snippets to beginning of file- Code in curly brackets associates a snippet with a particular identifier — note this still must be valid {C++, Java, whatever language} syntax
$
— token attribute@init
— about to start looking for anexpr
, we execute the snippet- Non-terminals can have properties, like return values
- These properties are only associated with a non-terminal instance — if we call
$expr.op
for two differentexpr
s, we get different values
- These properties are only associated with a non-terminal instance — if we call
- We can also have local variables, which must be defined between the
returns
and the@init
- In our C++ source file, ANTLR will create a method with the same name as the start symbol of our program that we can invoke
- when we invoke, we start the parsing process from that symbol
- We build the AST in ANTLR with these actions
- At
@init
, we construct a new node, at the end, we return it - We also define a local
inode
to make things easier for us
- At
- Alternate way is to use a “visitor” class to build the AST, where we traverse the parse tree after its built and constructs it by traversing the tree
- ANTLR can automatically generate a visitor class
antlrcpp::Any
is likevoid*
, a placeholder that can refer to any type- Probably need to go through ANTLR documentation
- Pros and cons?
- Actions: clutter the action file, don’t get access to C++ tools, difficult to convert actions from one parse generator to another (tedious)
- Now, people like to keep things clean in the grammar file and keep things separate (using separate visitors).
- SmallC
ExprNode
class will never show up in the tree, it functions only to be inherited fromASTNodes
— implementation will be given, either as.cpp
or.o
17 Oct 2024
- Why a symbol table?
- Subsequent phases need information — may be used for IR generation, depending on compiler may be used past front end
- Basic symbol table — columns designed to serve multiple purposes
- Table operations
- Retain symbol — primarily for
static
variables, which persist across function calls - Create — at the start of what? Depends on design
- Retain symbol — primarily for
- Hash table review — symbol tables are usually implemented with hash tables w/ open hashing
- Name arrays
- Old trick emphasised in the Dragon book. not really too important
- This is a historical trick - we don’t need it anymore
- This is just to prevent heap-allocated memory that might not necessarily be freed, so it’s allocated on the stack essentially
- type tables — reduce redundancy, so we don’t need to store the same type values (alignment, size, etc) multiple times
- scoping
- single symbol table, with undo stack
- we keep pushing onto the stack
- each time we reach a scope change, we add an
s
. when we exit this scope, we pop everything until thes
- for duplicate element name within new scope, we insert a new element, and since it hashes to the same location, we insert before the old element. this is mainly why we use open hashing
- multiple symbol tables, one per scope
- we chain back to parent scopes’ tables so that we can check which scope a symbol lives in
- and we can use a stack instead of explicit pointer chaining to manage the order
- a little less expensive to do deletion, less likely to have hash collisions
- single symbol table, with undo stack
- semantic analysis begin
- ensures the operations of the program all have a clear and well-defined meaning
- example checks
- for variable declaration: are variables previously declared in the same scope?
- for variable references: are they declared? what’s their type? is it a scalar?
- are expression types compatible?
- some languages do array bound checks, very hard to do
- java does this at runtime
- if error, we propagate it up. this allows us to output multiple errors in our failure message. G++ will also stop if it passes a certain error threshold
21 Oct 2024
- typing
- mixing of types:
if(x)
is valid even ifx
is anint
, even thoughif
formally operates on booleans, etc - also inheritance with base/derived class ptrs
- floats + int: 32-bit + 32-bit. if we interpret as usual, we get garbage out, since floats are represented with exponent + mantissa
- mixing of types:
- type systems
- name equivalence: easy to infer
- structural equivalence: depends on our definition. could be
int
followed byfloat
. vice versa if we allow for it.- unintended equivalences if structural equivalence is unconditional — difficult to use!
- compatibility:
int
andfloat
are compatible, by definition
- inference
- smallC — reference sol implements more than necessary
- Intermediate representation
- now: we have an AST that has well-defined operations
- multiple equivalent definitions, which we can combine
- many languages → single IR → many machines
- can’t totally avoid quirks of specific language
- reuse “most expensive” part of the compiler — since optimisers take the most amount of work to build
- IR extensibility — so that we can support new devices, new features (like tensors)
- types of IRs
- low level IRs good for binary translators, instrumentors — like translating from one variant of x86 to another — close to machine
- multilevel IRs: we translate from one IR to another IR. might facilitate better optimisations
- MLIR: extends on LLVM IR, for ML models in a graph format
- for example, MLIR goes from graph, to lower level, to lower level IRs
- lowering: generally more common than raising levels
- generally: level can determine types of optimisations
- IR example
- low level:
r2 = r1 * 40
assumes row major order and skips to rowr1
fromfp
- floats are 4 bytes, 10 columns → == 40
- note: bc of this, we have to know machine specific details: frame pointer
- also language specific: row major order
- medium level — workable bc we don’t need excessive amount of detail in low level to optimise on
- low level:
- lecture IR
- virtual registers: assume we have a machine with infinite number of registers
4 Nov 2024
- slide 13
- devoid of machine details — operations are simple load/stores/indexing, we use infinite registers
- offset by 4 bytes — implicit in this example but LLVM needs to be explicitly told
- slide 15
- LLVM optimiser — sequence of passes (analysis, transformation)
- clang — front-end for many languages, incl Java, Fortran, CUDA on top of C/cpp
- slide 16
- bitcode file
.bc
— similar to object file. binary representation of in-memory data structure. can build the in memory structure from a bitcode file, incl in between optimisation passes - interpreter
lli
for the bitcode file - text file
.ll
very useful for iterative development- optimisation might take 6-12 months to find value
- but we can do the optimisation by hand in the text file, feed it back to LLVM, and get the executable out. benchmark and test performance. if promising, then commit to the 6-12 months
- bitcode file
- slide 17
- clang is both a front end and driver — driving: getting a C file invokes lexical analysis, parsing, AST gen, IR gen, etc
- we can also pass in a bc/ll file
- we define operations/instructions
- slide 19
- opaque ptr: we have to follow the pointer and query it to find what type it is
- slide 20
common
— primarily for linking purposes (externs and such). ignorable for now
- slide 24
- named registers: begin with %, then alpha name
- unnamed: begin with %, then numeric id
- used as temp registers
- registers have a scope