Compiler for C0, a subset of C developed at CMU
x86-64 assembly code
In Fall 2019, as a semester-long project for the compilers course at CMU, me and my teammate Larry implement from scratch a compiler for C0, a safe subset of C developed at CMU. We started out with straight-line code, then incrementally added support for conditionals, functions, arrays/pointers/structs, optimizations, and try-catch exception handling. Our compiler had the fastest code speed in the course.
In Fall 2020, when I TA'ed the compilers course, I upgraded our compiler to add and refine multiple analysis and optimization passes, in the goal of making it as fast as possible. Our compiler is currently used as the reference compiler for CMU's compilers course.
Course and Teammates
Sep-Dec 2019, Oct-Nov 2020
Fastest known C0 compiler at CMU. Code speed 2 times faster than gcc -O1, on average, 5 times faster than CMU's original C0 reference compiler. Code size 10% smaller than gcc -O1 on average.
C0 is a safe subset of C, in the sense that it throws exceptions such as when array accesses are out of bound or when dereferencing a null pointer. Our compiler took in C0 program files, and translated them to linkable x86-64 assembly code. A list of the C0 language syntax we supported is found here. Our compiler consisted of multiple passes, as illustrated below:
c0 lexer, c0 parser
Concrete Syntax Tree
Quads in SSA form
liveness analysis, register allocation
Abstract Syntax Tree
x86-64 assembly code
Keeping these passes independent made it easy for us to overhaul our implementation of any single pass, and also made debugging easy, since we could stop at any pass to print out the intermediate result. Below I will briefly describe the passes of our compiler, then discuss the major optimizations we made.
Top Level Environment and Emitting Graphviz
Top level environment controls execution of all phases of the compiler. It also manages compiler flags, printing facilitiies, thresholds that determine whether certain optimizations are performed, and the global context that allows information passing between different optimizations passes. It also provides the functionality to generate .gv graphviz files of DOT language, which can draw cool graphs of the program in Quads IR, Quads SSA IR, Abstract Assembly IR, CFG, and Dominator Tree.
Concrete and Abstract Syntax Tree
We first parse the source code into a CST (concrete syntax tree), then elaborate into an AST (abstract syntax tree). Elaboration mainly serves to simplify the IR (for example, for loops are elaborated into while loops; arrow operators are elaborated into a dereference followed by a dot operator). The AST still maintains the high-level constructs of the L4 language.
We perform typechecking on AST. The typechecker outputs a Typed Syntax Tree (TST), which is same as an AST except that all expressions are annotated with their types. We then propagate the type information to all later IRs (IR tree, Quads, Abstract assembly), which is useful for alias analysis and debugging.
The IR tree data structure consists of assembly-like statements and expression trees. Expression trees are pure in that all operations in an expression tree have no side effects. The purpose of the IR Trees is to make the dynamic semantics of the program explicit by promoting all operations which could have an effect to the top level. Consequently, null checks and array index checks are also inserted during IR Tree translation.
IR trees are created using a maximal-munch algorithm on ASTs. The translation transforms high-level control-flow constructs (such as if and while statements) into low-level assembly constructs (such as Labels and Jumps). Expressions are largely left as is, except for effectful operations, which as described above are distilled from the expression tree. Variables in the AST are translated into temps, with new temps being created for intermediate results as needed.
Quads (quadruples of (target, opcode, operand1, operand2)) consists entirely of assembly-like instructions. Quads is three address code that only includes temps and constants, and does not include registers or stack locations. This makes Quads a backend-independent IR. IR tree expressions are translated into quads using a maximal-munch algorithm, while IR tree statements are directly translated as they already consist of assembly-like instructions.
Most optimizations in our compiler is performed either directly on Quads or after transforming Quads into SSA form.
Control flow graph (CFG)
Our CFG is a functor that can be built from either Quads or Abstract Assembly. The CFG consists of basic blocks (each of which contains a label, a body of instructions, and an exit instruction) and a directed graph representing edges between basic blocks. CFG also has a unique entry block and exit block. Currently, we build a CFG on each function in the program.
The CFG is a key data structure on which almost all intraprocedural optimizations are performed. Additionally, we also construct Dominator Tree and Loop Analysis Framework based on CFG. Dominator Tree contains information of immediate dominators, dominator children, and dominance frontiers. Loop Analysis Framework contains information of each loop in a function (such as all blocks in the loop, header block, exit block, parent loops, etc.).
Single Static Assignment (SSA)
We construct the SSA form of Quads with the help of dominance frontiers. We then perform the majority of our optimizations, some intraprocedural and some interprocedural, on SSA form. We carefully maintain the correctness of phi functions through the optimizations. Finally, we deconstruct SSA into Quads form, using CSSA deconstruction on some blocks if necessary (some optimizations might create the lost copy problem).
Abstract assembly consists entirely of assembly-like instructions. Similar to Quads, it is three address code. Unlike Quads, registers and stack locations now exist in the abstract assembly. Abstract Assembly is also x86-64-specific since it contains vector registers and conditional moves. We deal with calling conventions in abstract assembly, and we perform register allocation on abstract assembly. Additionally, we perform some x86-64-specific optimizations on Abstract Assembly.
We do register allocation on three address Abstract Assembly (two address code will lose some coalescing edges for binary operations). We first perform dataflow-based liveness analysis, where we do special preprocessing for divisions and modulos due to x86 conventions. We then perform a graph-based register allocation algorithm: maximum cardinality search followed by greedy coloring on simplical elimination ordering (SEO). Additionally, before doing greedy coloring, we perform prespilling (which moves temps in large number of max-cliques to end of SEO so that they will be spilled) and best-effort coalescing (coalesce nodes to the same color without introducing spills).
x86 code generation
x86-64 assembly is two address code directly representing the emitted x86-64 code of our compiler. Abstract Assembly is translated into x86-64 Assembly using the results of register allocation. We perform instruction selection and deal with various x86 conventions and ISA constraints. Additionally, we perform various peephole optimizations and cleaning of unnecessary jumps and loads. We also perform code alignment optimization, which aligns functions and top level loop headers to 16 bytes to help keep instructions within the L1 instruction cache.
We implemented numerous analysis and optimizations passes in our compiler. Many of our optimizations are based on algorithms from papers, though quite a few were our own innovations. The graph below show the enhanced passes of our compiler backend after adding in optimizations. I will write more about these optimizations in a tech report.
1. Purity analysis
2. Partial redundancy elimination
3. Strength reductions
4. Function inlining
5. Tailcall optimization
6. Cleaning control flow
7. Alias analysis
Quads in SSA form (with Φ funcs)
1. SCCP and copy propagation
2. Loop unrolling
3. Null check and bounds check elimination
4. Loop hoisting
5. Aggressive deadcode elimination
6. Code scheduling
7. Value range analysis
8. Loop index and iteration detection
9. Loop strength reductions
x86-64 assembly code
liveness analysis, register allocation
2. Conditional moves
3. Reducing loads
1. Code alignment
2. Pruning jumps
3. x86-specific peephole
Sparse Conditional Constant Propagation (SCCP)
SCCP performs constant propagation on the SSA form to get rid of unnecessary instructions, while simultaneously trimming branches that will never be visited. For example, if we know after constant propagation that a conditional will evaluate to True, then its False branch can be trimmed. Our SCCP algorithm is adapted from section 3.4 of Wegman and Zadeck. The key step is to construct a table that maps each temp to one of 3 statuses: Anything, Constant, or Indeterminable. Anything means we know of no possible values for a variable, Constant means we know exactly 1 possible value, and Indeterminable means we know of >=2 possible values. Thus the statuses can only change one-way: from Anything to Constant to Indeterminable. We initialize all normal temps to Anything, and all function arguments and memory temps to Indeterminable (so as to ignore memory temps during SCCP, given the difficulty of memory aliasing).
We maintain two worklists. One worklist is for SSA edges that goes from the definition of a temp to its uses. The other worklist is for CFG edges. An SSA edge of an temp is added to the worklist when the status of the temp changed, which happens at most 2 times. A CFG edge is added to the worklist when we encounter a conditional jump. Note that we ignore branches that we know for sure would not be visited. For Φ functions, we treat all arguments from unvisited blocks as Anything, and combine the arguments to determine the status of the Φ temp.
After finalizing the table, we can trim all instructions whose destination is Constant, since we already know the result of these instructions to be a constant. We use the table to overwrite temps in other instructions with their constant values. Then we perform a DFS from the first instruction to trim off the dead branches.
Since we traverse each SSA edge at most 2 times and each basic block at most once, the running time of SCCP is O(number of instructions), which is as good as we can possibly hope for.
Redundant Safety Check Elimination
C0 designates that each pointer dereference must be preceded by a null check. We came up with a clever way of getting rid of redundant null checks on the same temp.
For an instruction dereference a temp t1, we need not perform a null check either if we know for sure t1 is not null, or if t1 has already been checked along all possible execution paths reaching this instruction. With this observation, we maintain a null status table that maps each temp to a status (not null or indeterminable), and a null checks table that maps each basic block to good temps - temps that definitely have been checked before visiting or while visiting that block. We further observe that the good temps of a basic block is just the intersection of good temps of the predecessors that dominate this block. Thus we can perform propagation of basic blocks while maintaining these tables to filter out unnecessary null checks.
Improved Register Allocation
The first improvement we made to our register allocation algorithm is coalescing. For each coalescing edge between temp t1 and t2, to coalesce them, we can color them both c, where c is the lowest available color not assigned to any of t1 or t2's neighbors (if c results in a spill, we don't coalesce). We came up with using the union find data structure to support multiple rounds of coalescing efficiently, since coalescing two temps is equivalent to the merge operation in union find.
The second improvement was to add a score function to break ties in MCS. In naive MCS, when there are multiple vertices with minimum weight, we are free to choose any of them to visit next. Our scoring heuristic prioritizes visiting temps with a lot of uses (a use within a nested loop counts as many uses), and with few neighbors (so there is less register pressure). This is because we want to avoid spilling temps that are used a lot.
In order to calculate the nested depth of each instruction, we used the dominator tree and came up with the observation that a basic block is a loop header if and only if it dominates at least one predecessor. In fact, these predecessors are exactly the blocks within the loop body. Within these predecessors, we can find back edges going back to the loop header. To get the nested depth of a block B, we can simply count the number of loop headers dominating B. Each of these loop headers has at least one loop body block reachable from B via only non back edges.
Optimizing a compiler is challenging. There are many edge cases to take care of, and many tradeoffs to consider, especially in passes such as inlining, register allocation, and loop unrolling. However, I love solving challenging and neat problems, and compiler optimization is exactly that. It was extremely rewarding seeing the emitted code of my compiler become cleaner and more elegant after every smalll improvement. By working on this project, I not only gained more insight into compilers, but also into computer architecture, language design, and IR design.