Writing a Kernel From Scratch
x86 assembly code
Our 3rd project (p3) for CMU's operating systems course is to implement a microkernel from scratch using C and x86 assembly. We had to design and implement all components on our own, including virtual memory, locking and synchronization, context switching and scheduling, exception handling, device drivers, and support for various syscalls.
For our 4th project, we turned our microkernel from p3 into a hypervisor capable of hosting slightly-modified p3 kernels, using paravirtualization. The main challenges were managing virtual memory mapping for guest kernels and correct delivery of virtual interrupts/exceptions/traps.
Due to CMU's academic integrity policy, I will not discuss any code or design of our project. I will instead focus on describing the scope and major difficulties of the project.
Course and Teammates
Syscalls supported (link)
Project 3: Microkernel
Mar - Apr 2019
Project 4: Paravirtualization
Life cycle: fork, thread_fork, exec, set_status, vanish, wait, task_vanish
Thread: gettid, yield, deschedule, make_runnable, get_ticks, sleep, swexn
Memory: new_pages, remove_pages
Console I/O: get_char, readline, print, set_term_color, set_cursor_pos, ...
Miscellaneous: readfile, halt
Hypervisor: enable_interrupts, setidt, setpd, adjustpg, iret, print, exit, ...
Figure 1: Map showing the various components of our OS
OS has the reputation of being one of the "hardest" CS courses at CMU, and this microkernel project lives up to its reputation. We implement a kernel from scratch, given only a boot loader and a simple C library. Some components we needed to implement included virtual memory, locking mechanisms, context switcher, scheduler, exception handler, console driver, and support for various syscalls. Moreover, the course staff only teaches us overarching principles of OS design, and red flags to definitely avoid. The great majority of design decisions and implementation choices are left to us, so many designs that could pass the test cases might not be the best way, and we found out only after submitting our assignment and receiving feedback. Nevertheless, this freedom led us to think and reflect really hard on the pros and cons of each aspect of system design, and taught us "good taste" in writing systems code.
Our kernel is run on Simics, a system simulation/debugging platform similar to gdb. Kernel debugging is hard because of its nondeterministic nature: every invocation produces a different result and some bugs are hard to replicate. Often, we had to carefully set breakpoints, step through assembly line by line, and investigate physical memory content and special register values byte by byte. For Project 4: Paravirtualization, debugging is even harder because guest kernels are essentially a blackbox, so we needed to verify all details of environment set-up when passing control to guest kernel.
Below I will discuss the major requirements and difficulty of this project and what we learned through implementing it.
1. Preemptive kernel
Our kernel must be as preemptible as possible, meaning that no matter what code sequence is running, whether in kernel mode or user mode, when an interrupt arrives it can be handled very promptly and can cause a prompt context switch if appropriate. We must be able to support context switching every 1 or 2ms.
This means we must use the disable_interrupts macro sparingly, and write thread-safe code. This is extremely challenging given the large number of system calls we must support and the complexity of a kernel. Numerous race conditions are possible and must be considered. A good locking and synchronization library is crucial. We implemented a mutex, a conditional variable, and a readers-writer lock, each of which were used all throughout our kernel. These were very hard to get right, because writing a conditional variable for the kernel is quite different from writing one in the user space. We had to rehaul and refine our implementation of these multiple times throughout the project.
2. When is failure allowed?
A lot of system calls are NOT allowed to fail, meaning that we cannot arbitrarily kill a thread. For example, wait is a system call that must succeed, so we are not allowed to invoke the memory allocator in the implementation of wait, in case we run out of system memory and the memory allocator fails. Instead, we must plan ahead and allocate enough system memory for a thread/process when fork/thread_fork/exec is called, since fork/thread_fork/exec is allowed to fail. If fork/thread_fork/exec fails because there is no
3. x86 Architecture
Our kernel is built on the x86 architecture, so we had to read the Intel x86 documentation closely. Some conventions we encountered in our implementation included segment selector registers and GDT (global descriptor table), %cr3 and page directory entry flags and page table entry flags, EFLAG and control registers, IDT (interrupt descriptor table) and interrupt handlers.
4. Virtual memory (VM)
We wrote a frame allocator to manage physical frames, and a virtual memory manager on top of it that manages page directories and page tables. We used lazy (on-demand) allocation similar to Linux, deferring the allocation of physical pages until the virtual addresses are written to. Before that, all allocated but unwritten virtual memory point to the same fixed zero physical frame, and are caught by the page fault handler when written to. Thus we avoid allocating physical frames for mapped memory that will never be used. In our VM manager, some important helpers included ones that maps/unmaps all virtual memory in a region, and one for fork that copies part of a page directory mapping to a new page directory. We also had a robust library that handles validating/checking virtual memory addresses.
5. Context switching and mode switching
We maintain a TCB (thread control block) for each thread and a PCB (process control block) for each process on the kernel stack. Our context switcher switches between threads by changing %esp to switch between TCBs. Mode switching between kernel mode and user mode utilizes the iret call. One key design choice is scheduling. Even though we used a simple round-robin scheduler, there were many cases to consider. For example, we should only run threads that are runnable (so to be efficient our scheduling should only loop through runnable threads instead of suspended/sleeping/locked ones), but we must immediately mark a suspended/sleeping/locked thread runnable when it can once again do productive work. Moreover, threads that terminate sometimes have leftover memory that needs to be freed by the next thread or a dedicated cleaning thread, which requires carefully scheduling.
6. Exception handling
We installed trap gates and interrupt gates in the IDT to handle all 32 x86 exceptions. The key handlers for exceptions were the page fault handler and swexn. We adopted the zero-fill on demand convention and handed out new pages to user processes via the page fault handler. swexn was one of the more involved syscalls, and it goes through checking multiple edge cases of invalid user argument and setting up the exception stack to register/deregister a software exception handler.
7. Efficiency of kernel
One key principle in our kernel design is efficiency, both in runtime and memory. This principle guides many of our design choices. For example, busy spinning is absolutely forbidden, and we avoided holding process-wide locks during any strenuous, long-running task. Another example is, we often needed to update page tables or the page directory of a process, and it is inefficient to write to %cr3 each time we update, which will invalidate entries for all pages in the TLB not marked global. Instead, we use the INVLPG instruction to only invalidate necessary entries.
7. Implementing data structures
We had to implement our own data structures, such as a circular doubly linked list, a queue, and a linear probing hash table. The hash table was particularly important as it ensured O(1) average running time for some of our syscalls. The hash table needed to be resizable, both to become larger and smaller to adapt to changing number of threads, and we only resize it in calls that could fail, such as fork/exec/new_pages.
8. Device drivers
We were required to implement a timer driver, keyboard driver, and console driver. The readline syscall was especially tricky and involved many edge cases and potential races. We found that the architectures most appropriate here were a readers-write lock and circular buffer.
9. Writing assembly code
The majority of our compiler is written in C, but some sections that required delicate juggling of registers were written in x86 assembly. These included part of the fork and swexn system call, the mode switch handler, context switch handler, fault/interrupt/trap gates, syscall handlers that delegates to C functions, and other helpers such as xchg and panic.
10. Creative debugging methods
Occasionally we had to resort to unconventional debugging methods. For example, one livelock solution we often encountered was thread A holding a lock that thread B needs, while thread B holds a lock that thread A needs, and the two threads trying to yield to each other, thus stuck in a context switching loop. The problem is, we couldn't identify the identity of thread A and thread B and where they were stuck. We could try to print out information, but the livelock often happens quite deep into a test case, and printing out information at each context switch will take forever to finish the test case, and often fail to replicate the livelock. Therefore, one debugging method we came was to reserve one special byte in kernel memory, and check the bits of this byte at various points in the program to decide whether to print out debugging information. We will set all bits of this byte to 0 at the start, but once the program is stuck, we will use Simics to set this byte to 0xF, which will print out the debugging information we need.
11. Safety and privacy
The kernel must not crash in any scenario, so it must be suspicious of the user of the kernel. System calls involve substantial communication of data between user memory and kernel memory. These transfers must be carefully designed to avoid giving user code the ability to crash the kernel. There are many potential hazards to be handled, so we wrote a small library of helpers to check for these hazards. Of course, we needed to handle edge cases such as when a user-passed address is valid during checking but becomes invalid when used, perhaps because another thread modified it.
Another concern is privacy. For example, when the kernel exits, it must not leave traces in the physical memory, which can be used to infer confidential information regarding the kernel or user. So we need to zero out physical memory when appropriate.
12. Designing a hypervisor
As a followup project, we turned our kernel into a hypervisor that could run 15410-style microkernels using paravirtualization, a.k.a the trap and emulate approach. One big challenge is that there are now 3 modes: host kernel, guest kernel, and guest user. Each mode needed different privileges and address spaces. For this project we had to implement our own loader of a kernel. One task, for example, was to parse the guest’s ELF image (text, rodata, data, BSS) and load them into the guest’s address space at the indicated addresses. We also expanded our console to virtual consoles, where threads in each process share a virtual console, and a user could switch between consoles.
Virtual memory mapping is also trickier because there are 2 levels of mapping - the mapping within each guest kernel, and the actual physical mapping known only by the host kernel. We can only afford to store the page directory and tables of the host kernel in kernel memory, and guest tables had to be parsed and modified on the fly.
Most of our work was focused on intercepting and delivering timer interrupts, keyboard interrupts, syscalls, and exceptions to guest kernels. We needed to carefully set up every bit of the environment when we pass control to the guest kernel, else it would crash whilst giving us very little useful debugging information.
Our microkernel took 7400 lines of C and assembly, and 300 hours of pair coding between my and my partner to complete, while the hypervisor took another 1900 lines of C and 120 hours of coding. This was much harder and large-scope than any project we have tackled up to that point, so we both felt a sense of achievement to complete all projects of the course. It was challenging, but I did feel like a different programmer after the course.
For example, I learned to draw grids when making difficult design choices, with the horizontal rows denoting choice A, choice B, etc., and the vertical columns denoting the various criteria/metrics. I also learned that debugging begins with a “story” of what I expect to be happening and continues with measuring the system at increasing levels of detail to determine the point of divergence between expectation and reality. I also learned intricate details of the x86 architecture, got comfortable writing assembly code, and trained myself to spot race conditions with my bare eyes.
But most importantly, this course gave me the confidence to tackle hard systems programming tasks. For a time I even had the illusion that if I could handle debugging in an indeterministic kernel, I could debug anything. I hope to take up more challenging and interesting projects in the future.