Course introduces the theory and design of compilers and interpreters. Laboratory work is done in ANTLR and LLVM with C++.

Concepts covered

Extra notes

Parsing lectures

  • When at a non-terminal, we need to add all transitions such that appears.
  • For each possible next character, we add a new state + transition edge.
    • And move forward the dot.
  • We don’t add duplicate states - no reason for us to do that. Just point multiple edges towards a given state
  • $ == accept state (double rings)
  • Remember neither terminal nor non-terminal.
    • if we have , we get no reduction,
  • why reduce under every token?
    • irrespective of what input token is, we reduce
    • based solely on what’s on the stack
  • conflicts
    • two dots at the end of different productions == ambiguity

7 Oct 2024

  • slide 139
    • two choices: we could’ve either done E-E or E-
    • but because we reduce first we lose the ability to reduce E - E in LR(0)
  • slide 141
    • conflicts in state 2
  • slide 143
    • LR(1): largely extends on LR(0)
    • conflicts avoidable with a lookahead
  • slide 144
    • LR(1): more states compared to LR(0), but allows us to distinguish between conflicts
  • slide 145
    • EBNF colons == right arrow
    • NEWLINE: [\r\n]+;
  • slide 146
    • resolves first rule violations then moves down the file
  • AST slides
    • Abstract syntax tree: simplifies the parse tree.
    • Removes elements necessary within the grammar but are otherwise not necessary practically. We don’t change the program’s meaning/operations.
    • We also embed the AST with lots of metadata.
  • slide 7
    • right, children pointers
    • we use a Threaded tree to add pointers to the parent
  • slide 8
    • alternatively, we could maintain a vector of child pointers
    • we keep threading to parents
    • note: compiler references usually quite old, optimise for memory. now we don’t really care so we don’t have to optimise too much
  • slide 11
    • basically embedding C++ code within an ANTLR4 file
    • each given line of C++ wrapped in curly brackets is an action
    • token attributes: useful values we can embed within C++

10 Oct 2024

  • exam logistics
    • past midterms: don’t expect proof questions in scope of current midterm
  • lab logistics
    • marks to come out soon, can check all test cases after submission deadline with exercise command
  • two ways to build AST: incrementally during parse tree construction, or after parse tree is built
  • ANTLR allows us to build it during parse tree construction
    • @header prepends code snippets to beginning of file
    • Code in curly brackets associates a snippet with a particular identifier — note this still must be valid {C++, Java, whatever language} syntax
    • $ — token attribute
    • @init — about to start looking for an expr, we execute the snippet
    • Non-terminals can have properties, like return values
      • These properties are only associated with a non-terminal instance — if we call $expr.op for two different exprs, we get different values
    • We can also have local variables, which must be defined between the returns and the @init
  • In our C++ source file, ANTLR will create a method with the same name as the start symbol of our program that we can invoke
    • when we invoke, we start the parsing process from that symbol
  • We build the AST in ANTLR with these actions
    • At @init, we construct a new node, at the end, we return it
    • We also define a local inode to make things easier for us
  • Alternate way is to use a “visitor” class to build the AST, where we traverse the parse tree after its built and constructs it by traversing the tree
  • ANTLR can automatically generate a visitor class
    • antlrcpp::Any is like void*, a placeholder that can refer to any type
    • Probably need to go through ANTLR documentation
  • Pros and cons?
    • Actions: clutter the action file, don’t get access to C++ tools, difficult to convert actions from one parse generator to another (tedious)
    • Now, people like to keep things clean in the grammar file and keep things separate (using separate visitors).
  • SmallC
    • ExprNode class will never show up in the tree, it functions only to be inherited from
    • ASTNodes — implementation will be given, either as .cpp or .o

17 Oct 2024

  • Why a symbol table?
    • Subsequent phases need information — may be used for IR generation, depending on compiler may be used past front end
  • Basic symbol table — columns designed to serve multiple purposes
  • Table operations
    • Retain symbol — primarily for static variables, which persist across function calls
    • Create — at the start of what? Depends on design
  • Hash table review — symbol tables are usually implemented with hash tables w/ open hashing
  • Name arrays
    • Old trick emphasised in the Dragon book. not really too important
    • This is a historical trick - we don’t need it anymore
    • This is just to prevent heap-allocated memory that might not necessarily be freed, so it’s allocated on the stack essentially
  • type tables — reduce redundancy, so we don’t need to store the same type values (alignment, size, etc) multiple times
  • scoping
    • single symbol table, with undo stack
      • we keep pushing onto the stack
      • each time we reach a scope change, we add an s. when we exit this scope, we pop everything until the s
      • for duplicate element name within new scope, we insert a new element, and since it hashes to the same location, we insert before the old element. this is mainly why we use open hashing
    • multiple symbol tables, one per scope
      • we chain back to parent scopes’ tables so that we can check which scope a symbol lives in
      • and we can use a stack instead of explicit pointer chaining to manage the order
      • a little less expensive to do deletion, less likely to have hash collisions
  • semantic analysis begin
  • ensures the operations of the program all have a clear and well-defined meaning
  • example checks
    • for variable declaration: are variables previously declared in the same scope?
    • for variable references: are they declared? what’s their type? is it a scalar?
    • are expression types compatible?
    • some languages do array bound checks, very hard to do
      • java does this at runtime
    • if error, we propagate it up. this allows us to output multiple errors in our failure message. G++ will also stop if it passes a certain error threshold

21 Oct 2024

  • typing
    • mixing of types: if(x) is valid even if x is an int, even though if formally operates on booleans, etc
    • also inheritance with base/derived class ptrs
    • floats + int: 32-bit + 32-bit. if we interpret as usual, we get garbage out, since floats are represented with exponent + mantissa
  • type systems
    • name equivalence: easy to infer
    • structural equivalence: depends on our definition. could be int followed by float. vice versa if we allow for it.
      • unintended equivalences if structural equivalence is unconditional — difficult to use!
    • compatibility:
      • int and float are compatible, by definition
    • inference
  • smallC — reference sol implements more than necessary
  • Intermediate representation
  • now: we have an AST that has well-defined operations
  • multiple equivalent definitions, which we can combine
  • many languages single IR many machines
    • can’t totally avoid quirks of specific language
  • reuse “most expensive” part of the compiler — since optimisers take the most amount of work to build
  • IR extensibility — so that we can support new devices, new features (like tensors)
  • types of IRs
    • low level IRs good for binary translators, instrumentors — like translating from one variant of x86 to another — close to machine
    • multilevel IRs: we translate from one IR to another IR. might facilitate better optimisations
      • MLIR: extends on LLVM IR, for ML models in a graph format
      • for example, MLIR goes from graph, to lower level, to lower level IRs
      • lowering: generally more common than raising levels
    • generally: level can determine types of optimisations
  • IR example
    • low level:
      • r2 = r1 * 40 assumes row major order and skips to row r1 from fp
        • floats are 4 bytes, 10 columns == 40
      • note: bc of this, we have to know machine specific details: frame pointer
      • also language specific: row major order
    • medium level — workable bc we don’t need excessive amount of detail in low level to optimise on
  • lecture IR
    • virtual registers: assume we have a machine with infinite number of registers

4 Nov 2024

  • slide 13
    • devoid of machine details — operations are simple load/stores/indexing, we use infinite registers
    • offset by 4 bytes — implicit in this example but LLVM needs to be explicitly told
  • slide 15
    • LLVM optimiser — sequence of passes (analysis, transformation)
    • clang — front-end for many languages, incl Java, Fortran, CUDA on top of C/cpp
  • slide 16
    • bitcode file .bc — similar to object file. binary representation of in-memory data structure. can build the in memory structure from a bitcode file, incl in between optimisation passes
    • interpreter lli for the bitcode file
    • text file .ll very useful for iterative development
      • optimisation might take 6-12 months to find value
      • but we can do the optimisation by hand in the text file, feed it back to LLVM, and get the executable out. benchmark and test performance. if promising, then commit to the 6-12 months
  • slide 17
    • clang is both a front end and driver — driving: getting a C file invokes lexical analysis, parsing, AST gen, IR gen, etc
    • we can also pass in a bc/ll file
    • we define operations/instructions
  • slide 19
    • opaque ptr: we have to follow the pointer and query it to find what type it is
  • slide 20
    • common — primarily for linking purposes (externs and such). ignorable for now
  • slide 24
    • named registers: begin with %, then alpha name
    • unnamed: begin with %, then numeric id
      • used as temp registers
    • registers have a scope

7 Nov 2024

  • IR
    • common — related to how it’s linked. this is allocated within this file
      • variables could be extern, so it might be allocated elsewhere
    • alloca local variable on the stack — can be read and written from/to
    • @ global variable
      • i.e., @main, @malloc
    • % registers or local variable (if defined with alloca) — take a value and cannot be changed
    • #0 attributes, decorators post optimisation. “breadcrumbs to show what you’ve done”
    • local_unnamed_addr — there used to be a pointer here, but is not particularly significant — these are generated automatically
    • nuw no unsigned wrap — if overflow/underflow, result is invalid
    • barring gotos, labels for control flow are generated automatically
    • must always ret from a function
    • load/stores
      • now ptr instead of <type>* for new LLVM
    • mostly similar to Assembly programming - just with syntactical changes
    • aggregate types: array, struct, or class
      • new notation: getelementptr ptr %a, i64 1
    • phi instruction: consequence of SSA enforcement
      • just creates a shield for later values, doesn’t conditionally evaluate
      • fictitious instruction that does this - not really a real instruction
    • misc instructions used to do explicit type conversions
    • SetInsertPoint() inserts at a given point, can insert before existing blocks
  • optimisations
    • algebraic simplifications — where we already know the result
    • CSE — common subexpression elimination
      • removes repeatedly computed values
      • need to prove that repeated computation is guaranteed to be the same if we want to remove, i.e., it doesn’t change between computations
    • copy propagation — no need to use the copy, just use the original
    • DCE — dead code elimination
      • remove things that are computed but are never used
    • IVR — induction variable recognition, IVE — induction variable elimination
      • we don’t need the loop iterators (i.e., what controls the loop). we can work directly with the addr
  • control flow analysis
    • intra-procedural — not the focus of this course
    • basic blocks
      • goto S5 — means we have to split BB1 and BB2, because we could start in the middle of a BB
      • at conditional branches, we must branch at the end, not anywhere in the middle
    • control flow graphs
      • multiple exit nodes == multiple returns
      • multiple entry nodes? not in C/C++/relevant langs. just a historical curiosity (Fortran)
      • branch nodes — denotes a conditional branch in the code

11 Nov 2024

  • control flow graphs
    • entry dominates every other node
    • a node dominates itself
    • proper dominance — basically removing the equality,
    • dominator — what BB every path from entry to BB contains
      • split in the path? doesn’t include in the set of dominators
    • set of dominators of a node is unique
      • suppose by contradiction that for a node we have two sets of dominators that are non-unique, such that has some node that doesn’t exist in , and vice versa.
      • then that implies a contradiction somehow
    • DOM_d is a singleton set, i.e., it can only have one element in the set
  • algorithm for finding dominators
    • assertion: if is a dominator of , then it is a dominator of all
    • faster if DFS, we can complete in a single iteration and be done
  • set of post dominators in the forward graph are the set of dominators in the reverse graph
  • loop detection
    • if there’s an entry point in the middle of the loop, then it dominates all other nodes in the loop
    • tail is a descendent of head, and edge back to ancestor is a back edge

14 Nov 2024

  • natural loops — single entry, unique tail
  • finding back edges — statements presented without proof
    • for possibly reducible graphs — we must check if the head dominates the tail
  • algorithm for finding loops
    • pop node, insert predecessors onto stack
  • slide 98
    • core idea: given a d = 8, can we show that it won’t be changed the next time it’s used (read) — if no change, we can optimise e = 8 + 2 then e = 10
    • if there’s any change in one of the branches, we conclude we cannot optimise
  • local reaching definitions — within a BB. at the beginning, we’ve made no definitions
    • RDin(s) — set of all RDs from the previous blocks that survive to the beginning of this BB
    • Gen(s) — generate set, set of definitions that we create
    • Kill(s) — kill set, set of definitions killed
  • suppose we have a back edge during RD data flow
    • we need RDout(1) and RDout(5) to calculate RDin(2). we don’t know RDout(5) yet. we assume for now, that RDout(5) is . then, when we come to it, we update and repeat the analysis until no changes happen
    • fixed point iteration algorithm — useful for fixing cyclic dependencies
  • reaching definitions solver
    • initialise with for all RDout(b)
    • keep iterating until we’ve reached a “fixed point”, when no more changes are possible
    • Gen(b) better than since it means we start one iteration ahead
    • termination?
      • since the set of definitions is always finite, we’ll terminate
      • we only make set unions. at worst, we add until we add everything, but since everything is finite we will terminate
      • monotonic algorithm
    • correctness? assume correctness!
    • how many iterations? no more than the number of back edges + 1
    • forward any-path problem — forward bc we’re asking a q about what happened earlier in the CFG; any path bc property must hold for any path, we see set union
  • live variable analysis
    • used for register allocation — very easy to understand why — if dead, we don’t need to keep in register and can throw it out
    • ground truth — at the end of the BB, nothing is live by definition
    • b = a + b, b live then immediately killed, since read occurs before write check

18 Nov 2024

  • live variable analysis
    • Use, Def analogous to Gen, Kill sets for reaching definitions
    • data flow eqns — remember we work from bottom up
    • why initialise with use(b)? saves one iteration — recall LVin takes set union with use(b), if initialised with , we have to set union at first with use(b) anyways
    • in what sense is this conservative?
      • it looks at all possible paths of execution, irrespective of whether paths are actually taken during runtime
      • how is this conservative? has to be live in any one possible path
      • assume path not taken in runtime — static analysis still checks every path
        • for a use to be live, it can be live on any path
  • available exprs
    • x + b not available, since x written to in one path; d + f available, since not overwritten in any path
    • exprs: d * d is available, since the expr survives from two paths
    • exprs: we don’t need the value, just the expression surviving, even if values diverge
      • how to optimise (to reduce “ambiguity”) this one? t = d * d, f = t in top bb
      • t = d * d, e = t for left bb
      • then whatever = t at the end

25 Nov 2024

  • available exprs
    • will it terminate? yes, because keep eating away at the sets
    • is it correct? assume
    • how fast does it terminate? same as usual
    • conservative? yes, considers all path by taking intersection
    • asking question about past — forward q
  • very busy expr — same as usual
  • forward — at entry phi; backward — forward in time, at end phi
  • local optimisations
    • constant folding: always allowable?
      • integers - depends on over/underflow. if overflow, print warning but don’t terminate compiler
      • floats — what if hardware spec is different from compiling spec for floats? rounding errors — so compilers don’t fold floats at all bc too risky
    • algebraic simplifications
      • i * 7, (i << 3) - i — what if bitshift results in overflow? not always safe
    • local CSE

28 Nov 2024

  • global CSE
    • determine evaluations of x op y: with reaching evaluations analysis (very similar to reaching definitions)
  • copy propagation
    • reaching definitions
    • s must dominate u
    • kinda messy - we do another type of data flow analysis called available copy expressions
  • unreachable code
    • so doesn’t fill instruction cache. still reduces speed somewhat
  • dead code elimination
    • usual cycle is: CSE, copy prop, DCE
    • start from essential instructions for output, then move up to definitions, then to control dependencies
  • loop invariant code motion
    • sometimes loop invariant code is entirely unintentional — 2D array reference for example. we don’t need to compute the row offset for each inner loop iteration
    • why exactly one? see example. we compute inside the loop depending on the branch. so we can’t take outside of the loop and pre-compute
    • example
      • step 1: label constants, reaching defns from outside of the loop
      • step 2: any loop invariant instructions?
      • repeat

2 Dec 2024

  • code motion example pp 229
    • changed from f = 1 + a at branch to f = d + a
    • we can’t move d bc it doesn’t dominate the exit, and it’s not the only definition
  • loop invariant code motion — final scope ends here