Course introduces the design of compilers and interpreters. Laboratory work is done in ANTLR with C++.
Concepts covered
- Lexical analysis
- Formal language
- Operators (concatenation, alternation, Kleene closure)
- Regular expression
- Finite state automata
- Deterministic finite automata (DFA), non-deterministic finite automata (NFA)
- Regex to NFA conversions
- NFA to DFA conversions (epsilon-closures)
- Formal language
- Parsing
- Grammar
- Context-free grammar
- Extended Backus-Naur form
- Derivations, grammar ambiguity
- Parse tree
- Parsing table
- Parsing algorithms
- Top-down parsing
- LL(1)
- First, follow sets
- Bottom-up parsing
- Shift-reduce operations
- LR(0), LR(1)
- Top-down parsing
- Parser states
- State closures
- goto function
- Grammar
- Abstract syntax tree
- Construction via actions, visitors
- Symbol table
- Hash table, linked list
- Name arrays
- Type tables
- Semantic analysis
- Data type (type systems)
- Type equivalence
- Type compatibility
- Type inference
- Intermediate representation (IR)
- LLVM intermediate representation
- Compiler optimisation
Extra notes
Parsing lectures
- When at a non-terminal, we need to add all transitions such that appears.
- For each possible next character, we add a new state + transition edge.
- And move forward the dot.
- We don’t add duplicate states - no reason for us to do that. Just point multiple edges towards a given state
- $ == accept state (double rings)
- Remember neither terminal nor non-terminal.
- if we have , we get no reduction,
- why reduce under every token?
- irrespective of what input token is, we reduce
- based solely on what’s on the stack
- conflicts
- two dots at the end of different productions == ambiguity
7 Oct 2024
- slide 139
- two choices: we could’ve either done E-E or E-
- but because we reduce first we lose the ability to reduce E - E in LR(0)
- slide 141
- conflicts in state 2
- slide 143
- LR(1): largely extends on LR(0)
- conflicts avoidable with a lookahead
- slide 144
- LR(1): more states compared to LR(0), but allows us to distinguish between conflicts
- slide 145
- EBNF colons == right arrow
NEWLINE: [\r\n]+;
- slide 146
- resolves first rule violations then moves down the file
- AST slides
- Abstract syntax tree: simplifies the parse tree.
- Removes elements necessary within the grammar but are otherwise not necessary practically. We don’t change the program’s meaning/operations.
- We also embed the AST with lots of metadata.
- slide 7
- right, children pointers
- we use a Threaded tree to add pointers to the parent
- slide 8
- alternatively, we could maintain a vector of child pointers
- we keep threading to parents
- note: compiler references usually quite old, optimise for memory. now we don’t really care so we don’t have to optimise too much
- slide 11
- basically embedding C++ code within an ANTLR4 file
- each given line of C++ wrapped in curly brackets is an action
- token attributes: useful values we can embed within C++
10 Oct 2024
- exam logistics
- past midterms: don’t expect proof questions in scope of current midterm
- lab logistics
- marks to come out soon, can check all test cases after submission deadline with exercise command
- two ways to build AST: incrementally during parse tree construction, or after parse tree is built
- ANTLR allows us to build it during parse tree construction
@header
prepends code snippets to beginning of file- Code in curly brackets associates a snippet with a particular identifier — note this still must be valid {C++, Java, whatever language} syntax
$
— token attribute@init
— about to start looking for anexpr
, we execute the snippet- Non-terminals can have properties, like return values
- These properties are only associated with a non-terminal instance — if we call
$expr.op
for two differentexpr
s, we get different values
- These properties are only associated with a non-terminal instance — if we call
- We can also have local variables, which must be defined between the
returns
and the@init
- In our C++ source file, ANTLR will create a method with the same name as the start symbol of our program that we can invoke
- when we invoke, we start the parsing process from that symbol
- We build the AST in ANTLR with these actions
- At
@init
, we construct a new node, at the end, we return it - We also define a local
inode
to make things easier for us
- At
- Alternate way is to use a “visitor” class to build the AST, where we traverse the parse tree after its built and constructs it by traversing the tree
- ANTLR can automatically generate a visitor class
antlrcpp::Any
is likevoid*
, a placeholder that can refer to any type- Probably need to go through ANTLR documentation
- Pros and cons?
- Actions: clutter the action file, don’t get access to C++ tools, difficult to convert actions from one parse generator to another (tedious)
- Now, people like to keep things clean in the grammar file and keep things separate (using separate visitors).
- SmallC
ExprNode
class will never show up in the tree, it functions only to be inherited fromASTNodes
— implementation will be given, either as.cpp
or.o
17 Oct 2024
- Why a symbol table?
- Subsequent phases need information — may be used for IR generation, depending on compiler may be used past front end
- Basic symbol table — columns designed to serve multiple purposes
- Table operations
- Retain symbol — primarily for
static
variables, which persist across function calls - Create — at the start of what? Depends on design
- Retain symbol — primarily for
- Hash table review — symbol tables are usually implemented with hash tables w/ open hashing
- Name arrays
- Old trick emphasised in the Dragon book. not really too important
- This is a historical trick - we don’t need it anymore
- This is just to prevent heap-allocated memory that might not necessarily be freed, so it’s allocated on the stack essentially
- type tables — reduce redundancy, so we don’t need to store the same type values (alignment, size, etc) multiple times
- scoping
- single symbol table, with undo stack
- we keep pushing onto the stack
- each time we reach a scope change, we add an
s
. when we exit this scope, we pop everything until thes
- for duplicate element name within new scope, we insert a new element, and since it hashes to the same location, we insert before the old element. this is mainly why we use open hashing
- multiple symbol tables, one per scope
- we chain back to parent scopes’ tables so that we can check which scope a symbol lives in
- and we can use a stack instead of explicit pointer chaining to manage the order
- a little less expensive to do deletion, less likely to have hash collisions
- single symbol table, with undo stack
- semantic analysis begin
- ensures the operations of the program all have a clear and well-defined meaning
- example checks
- for variable declaration: are variables previously declared in the same scope?
- for variable references: are they declared? what’s their type? is it a scalar?
- are expression types compatible?
- some languages do array bound checks, very hard to do
- java does this at runtime
- if error, we propagate it up. this allows us to output multiple errors in our failure message. G++ will also stop if it passes a certain error threshold
21 Oct 2024
- typing
- mixing of types:
if(x)
is valid even ifx
is anint
, even thoughif
formally operates on booleans, etc - also inheritance with base/derived class ptrs
- floats + int: 32-bit + 32-bit. if we interpret as usual, we get garbage out, since floats are represented with exponent + mantissa
- mixing of types:
- type systems
- name equivalence: easy to infer
- structural equivalence: depends on our definition. could be
int
followed byfloat
. vice versa if we allow for it.- unintended equivalences if structural equivalence is unconditional — difficult to use!
- compatibility:
int
andfloat
are compatible, by definition
- inference
- smallC — reference sol implements more than necessary
- Intermediate representation
- now: we have an AST that has well-defined operations
- multiple equivalent definitions, which we can combine
- many languages → single IR → many machines
- can’t totally avoid quirks of specific language
- reuse “most expensive” part of the compiler — since optimisers take the most amount of work to build
- IR extensibility — so that we can support new devices, new features (like tensors)
- types of IRs
- low level IRs good for binary translators, instrumentors — like translating from one variant of x86 to another — close to machine
- multilevel IRs: we translate from one IR to another IR. might facilitate better optimisations
- MLIR: extends on LLVM IR, for ML models in a graph format
- for example, MLIR goes from graph, to lower level, to lower level IRs
- lowering: generally more common than raising levels
- generally: level can determine types of optimisations
- IR example
- low level:
r2 = r1 * 40
assumes row major order and skips to rowr1
fromfp
- floats are 4 bytes, 10 columns → == 40
- note: bc of this, we have to know machine specific details: frame pointer
- also language specific: row major order
- medium level — workable bc we don’t need excessive amount of detail in low level to optimise on
- low level:
- lecture IR
- virtual registers: assume we have a machine with infinite number of registers
4 Nov 2024
- slide 13
- devoid of machine details — operations are simple load/stores/indexing, we use infinite registers
- offset by 4 bytes — implicit in this example but LLVM needs to be explicitly told
- slide 15
- LLVM optimiser — sequence of passes (analysis, transformation)
- clang — front-end for many languages, incl Java, Fortran, CUDA on top of C/cpp
- slide 16
- bitcode file
.bc
— similar to object file. binary representation of in-memory data structure. can build the in memory structure from a bitcode file, incl in between optimisation passes - interpreter
lli
for the bitcode file - text file
.ll
very useful for iterative development- optimisation might take 6-12 months to find value
- but we can do the optimisation by hand in the text file, feed it back to LLVM, and get the executable out. benchmark and test performance. if promising, then commit to the 6-12 months
- bitcode file
- slide 17
- clang is both a front end and driver — driving: getting a C file invokes lexical analysis, parsing, AST gen, IR gen, etc
- we can also pass in a bc/ll file
- we define operations/instructions
- slide 19
- opaque ptr: we have to follow the pointer and query it to find what type it is
- slide 20
common
— primarily for linking purposes (externs and such). ignorable for now
- slide 24
- named registers: begin with %, then alpha name
- unnamed: begin with %, then numeric id
- used as temp registers
- registers have a scope
7 Nov 2024
- IR
common
— related to how it’s linked. this is allocated within this file- variables could be
extern
, so it might be allocated elsewhere
- variables could be
alloca
local variable on the stack — can be read and written from/to@
global variable- i.e.,
@main
,@malloc
- i.e.,
%
registers or local variable (if defined withalloca
) — take a value and cannot be changed#0
attributes, decorators post optimisation. “breadcrumbs to show what you’ve done”local_unnamed_addr
— there used to be a pointer here, but is not particularly significant — these are generated automaticallynuw
no unsigned wrap — if overflow/underflow, result is invalid- barring
goto
s, labels for control flow are generated automatically - must always
ret
from a function - load/stores
- now
ptr
instead of<type>*
for new LLVM
- now
- mostly similar to Assembly programming - just with syntactical changes
- aggregate types: array, struct, or class
- new notation:
getelementptr ptr %a, i64 1
- new notation:
phi
instruction: consequence of SSA enforcement- just creates a shield for later values, doesn’t conditionally evaluate
- fictitious instruction that does this - not really a real instruction
- misc instructions used to do explicit type conversions
SetInsertPoint()
inserts at a given point, can insert before existing blocks
- optimisations
- algebraic simplifications — where we already know the result
- CSE — common subexpression elimination
- removes repeatedly computed values
- need to prove that repeated computation is guaranteed to be the same if we want to remove, i.e., it doesn’t change between computations
- copy propagation — no need to use the copy, just use the original
- DCE — dead code elimination
- remove things that are computed but are never used
- IVR — induction variable recognition, IVE — induction variable elimination
- we don’t need the loop iterators (i.e., what controls the loop). we can work directly with the addr
- control flow analysis
- intra-procedural — not the focus of this course
- basic blocks
goto S5
— means we have to split BB1 and BB2, because we could start in the middle of a BB- at conditional branches, we must branch at the end, not anywhere in the middle
- control flow graphs
- multiple exit nodes == multiple returns
- multiple entry nodes? not in C/C++/relevant langs. just a historical curiosity (Fortran)
- branch nodes — denotes a conditional branch in the code
11 Nov 2024
- control flow graphs
- entry dominates every other node
- a node dominates itself
- proper dominance — basically removing the equality,
- dominator — what BB every path from entry to BB contains
- split in the path? doesn’t include in the set of dominators
- set of dominators of a node is unique
- suppose by contradiction that for a node we have two sets of dominators that are non-unique, such that has some node that doesn’t exist in , and vice versa.
- then that implies a contradiction somehow
- DOM_d is a singleton set, i.e., it can only have one element in the set
- algorithm for finding dominators
- assertion: if is a dominator of , then it is a dominator of all
- faster if DFS, we can complete in a single iteration and be done
- set of post dominators in the forward graph are the set of dominators in the reverse graph
- loop detection
- if there’s an entry point in the middle of the loop, then it dominates all other nodes in the loop
- tail is a descendent of head, and edge back to ancestor is a back edge
14 Nov 2024
- natural loops — single entry, unique tail
- finding back edges — statements presented without proof
- for possibly reducible graphs — we must check if the head dominates the tail
- algorithm for finding loops
- pop node, insert predecessors onto stack
- slide 98
- core idea: given a
d = 8
, can we show that it won’t be changed the next time it’s used (read) — if no change, we can optimisee = 8 + 2
thene = 10
- if there’s any change in one of the branches, we conclude we cannot optimise
- core idea: given a
- local reaching definitions — within a BB. at the beginning, we’ve made no definitions
RDin(s)
— set of all RDs from the previous blocks that survive to the beginning of this BBGen(s)
— generate set, set of definitions that we createKill(s)
— kill set, set of definitions killed
- suppose we have a back edge during RD data flow
- we need
RDout(1)
andRDout(5)
to calculateRDin(2)
. we don’t knowRDout(5)
yet. we assume for now, thatRDout(5)
is . then, when we come to it, we update and repeat the analysis until no changes happen - fixed point iteration algorithm — useful for fixing cyclic dependencies
- we need
- reaching definitions solver
- initialise with for all
RDout(b)
- keep iterating until we’ve reached a “fixed point”, when no more changes are possible
Gen(b)
better than since it means we start one iteration ahead- termination?
- since the set of definitions is always finite, we’ll terminate
- we only make set unions. at worst, we add until we add everything, but since everything is finite we will terminate
- monotonic algorithm
- correctness? assume correctness!
- how many iterations? no more than the number of back edges + 1
- forward any-path problem — forward bc we’re asking a q about what happened earlier in the CFG; any path bc property must hold for any path, we see set union
- initialise with for all
- live variable analysis
- used for register allocation — very easy to understand why — if dead, we don’t need to keep in register and can throw it out
- ground truth — at the end of the BB, nothing is live by definition
b = a + b
, b live then immediately killed, since read occurs before write ← check