This is a partially reconstructed backup as the wiki of Prof. Kirsch’s website, on which our project page was hosted, went offline.
In the course “Compiler Construction” of Prof. Cristoph Kirsch we are challenged to implement a compiler. The outcome is ought to be a self-compiling compiler which implies (somehow) to use a subset of an already existing programming language. In our case, we are going to implement a C-compiler (sure, a meaningful subset of C) and a Virtual Machine on which our executables are executed.
- Stefan Huber (shuber2
gmx . at, 0320662)
- Christian Rathgeb (c . rathgeb
sbg . at, 0320943)
- Stefan Walkner (swalkner
cosy .sbg .ac . at, 0320190)
Basic Compiler Properties
The scanner is a one-to-one mapping of the deterministic finite automaton to C source code. Internally the scanner is represented by C struct which contains a pointer to the pre-processor. For running the automaton the next character and the current state is saved as well. For more details you may want to read the documentation.
The parser is a recursive-descent parser for our language which is an element of LL(2). Due to the fact, that implementing a scanner and parser is straight forward, the most interesting point is the language itself. The language is listed in E-BNF in the appendix of the documentation. In chapter 4.1 you can find a discussion on the language design.
In our implementation the parser is also the instance which calls the code-generation and the symbol-table in contrast to a “natural” batch design.
The output language that is generated by our compiler is GNU Assembler (GAS). So GAS is our ‘object file format’. This gives us two choices to proceed:
- assemble the generated code with GCC and receive i.e. ELF binaries (or any other type of binaries - depending on the host system) or
- use the hrwcc linker to produce ‘executable’ code for the VM.
To produce executable code out of the object code we also developed a linker. The linkers main task is to replace all ‘markers’ in the generated code file(s) by their corresponding addresses in the object file code. It also handles ‘.globl’ statements (to export functions to other files) and prepend some kickoff code to call the ‘main’ function. The file that is produced by the linker is our ‘executable file format’. It already contains the calculated addresses for the .text and .data sections and does not have any marker left except some special functions (like printf, open, close, etc) that will be wrapped by the VM.
For a example (showing most of the features that hrwcc supports) please download features.c and see what happens to a c source file when it is thrown into hrwcc :D
We decided to simulate a 2GB address space as follows: we group the 2GB into 2048 slots (byte *memory) each carrying 1024kB. So functions like getb or setb which offer to access one byte in memory first have to find the right slot and the block. This concept is quite similar to Paging in Operating Systems. Further more handling integers we take care of so-called slot-boundaries not to cause unused space or corrupt data where there shouldn’t be. We also implemented the functions malloc and free which acquire to store an array of markers to assign used and unused space in memory. The stack is integrated in this memory and take up 2MB (=2 slots) at the start of each execution.
We choose to execute our final good on a virtual machine which we implemented in C++ to make use of object orientation. The required input of our VM is a nearly pure assembler file (see above) just leaving some special functions (e.g.: printf, open, read, write) to wrap (the equivalent libc-functions are called in the VM). The VM, which parses the input file, stores the data section in the virtual memory and creates instruction objects, consisting of an opcode and two operands if there are any (e.g.: ret, jmp, addl), for each line and executes every single one by operating on registers and virtual memory.
As you can see in the milestone of Code generation, the first (fully functional) release of hrwcc is finished and completely self compilable…
Due to good experiences in our operating system project OS-Winter-2006.HrwOS, we split the project up to several milestones. Although its hard to foresee the major disciplines in this project, we also tried to assign timestamps to them. But there is less time and so even a vague conception is important for us.
2007-03-08: Brainstorming [OK] Constitutional decisions (C, x86, ELF), basic concepts Milestones definition
2007-03-13: EBNF [OK]
2007-03-19: Brainstorming VM [OK]
Decide: VM vs. ELF (Object File Format & Linking)
2007-03-26: Preprocessor [OK]
Handle Comments, Directives (include, define, ifdef, ifndef, else, endif), Substitutions (Macros and Defines)
2007-03-19: Scanner [OK]
Build deterministic finite automaton Cast it to code
2007-04-09: Parser [OK]
Building data structure for syntax tree
Recursive descent parsing – our language is in LL(2)
Doing some error recovery
2007-04-12: Symbol-Table [OK]
Defining, Declaring functions, variables, structs and strings
Calculating sizes and proper error handling
First ‘compilation’ (parsing + sym.tab.filling) of symbol table source code itself
2007-04-28: Code generation [OK]
Generating gnu-assembler patterns for control structures like ‘if’, ‘while’, ‘break’, ‘return’, …
Emitting code for every EBNF-production
Compile euklid.c which generates the gcd of two integers and assemble it with gcc – works!
Check for types in assignment and function call
You may want to see euklid-ext.c.txt and euklid-ext.s.txt which contains a C-file and the produced asm-file.
Wrote fuzz-testing tool which is HrwCC compilable.
Currently self-compilable: preproc (18400 loc asm), scanner (6700), parser (14100), symbol-table (19400), codegen (27700), linker (10800). When linked with hrwcc it results in ~100.000 loc.
gcc compiles hrwcc, which compiles hrwcc, which compiles hrwcc. md5-sum of last two is identical. Size of “auto-compiled” hrwcc is about a double, speed a half of gcc compiled versions.
2007-04-29: Linker [OK]
Parse labels and assign addresses, parse .globl statements and resolve their addresses in the executable
Collect .text and .data section from multiple files
Add debugging symbols (leaves markers and add their addresses, leaves comments)
Kickoff code to call “main”
2007-07-12: Virtual Machine [OK]
Finished base components/functionality and started instruction parsing
Programming Memory abstraction (malloc/free/realloc etc)
Implementing wrapper functions (open, read, sprintf, atoi, isalnum, etc.)
hrwvm can execute hrwcc to compile features.c
Additional: Code optimization [OK]
Implemented efficient assignments and variable access
Implemented optimization at asm-level for several patterns of asm-code
Currently, the non-optimized output consists of 103906 loc modified asm-code. The optimized one is reduced to 81793 loc (22%)
**2007-06-26: Documentation and presentation [OK] **
The source code contains a ‘bootstrap.sh’ in the root directory which you may want to execute for self-compiling hrwcc three times and comparing the MD5 sums. The root directory also contains a Makefile and Makefile.hrwcc. The first one compiles hrwcc with g++ and Makefile.hrwcc compiles hrwcc with hrwcc. Have fun!
For a demonstration of the VM you might call the ‘hrwvm-demo.sh’ which compiles the features.c file once with the ELF-hrwcc binary and once with the hrwvm-executed compiler. After that the MD5 sum of both outputs is printed and the second one is executed by hrwvm. If you have much time, you can start the self-compilation process using hrwvm to execute hrwcc. You just have to call ‘make -f Makefile.hrwvm’ in the hrwcc directory. This will take you 1.5-2 days – you have been warned…