Implementing a garbage collector for Rust

This is only an article draft

Open Flash is a project to create an open-source media player for SWF files. Its goal is to keep Flash animations accessible for archival purposes. It aims to achieve it by learning from related projects and building an ecosystem of reusable components.

I am currently working on my Actionscript Virtual Machine. Actionscript is a garbage-collected language similar to Javascript. This means that users can create objects without worrying about their memory-management: freeing no-longer needed objects is the role of the VM. To achieve it, the VM uses a technique called "garbage-collection". My VM is written in Rust, it means that I need to implement a garbage-collector in Rust.

Mark & Sweep Garbage Collection

I see garbage collection as an extension of memory management offered by smart pointers.

The essential requirement of memory management is to provide ways to dynamically allocate portions of memory to programs at their request, and free it for reuse when no longer needed.

Wikipedia: Memory Management

Memory management is a hard problem. Rust's borrow-checker, one of its main innovation selling point, was actually created to deal with memory management. The main challenge is to free memory only once it is no longer needed. It sounds easy but it is easy to make a mistake when doing it manually.

By default, Rust's compiler prevents memory errors. This is achieved by adding constraints to your program (such as lifetimes) so Rust is able to reason about your code and prove that it is safe with regard to memory errors. The borrow-checker and abstractions in the standard library let you create most programs, but it is sometimes not enough. If you want manual control over memory, you can use Rust's unsafe blocks but you are on your own then: the compiler can no longer catch memory errors. Garbage collection is a memory-safe abstraction that lets you represent arbitrary data. It is a form of automatic memory management. The downside is that it has a higher runtime cost.

The different abstractions to handle memory are a bit like sets of numbers. You can start with simple ones and build more general ones: integers, rationals, real numbers, complex numbers, quaternions, and so on. The more general you are, the more complex (no pun intended) your model is. Using a more general model comes with a cost (that's why people prefer to avoid garbage collection in performance-critical code).

Static values are the simplest one. They just exist for the whole duration of your program. These are usually constants or global variables. Obviously their main downside is that you can't create them at runtime.

Stack-allocated values are the next simplest values. These are the values in the local variables of your functions. They are allocated when the function is called and freed when the function returns. They are also the most limited: their size must be known at compilation (they must have the Sized trait). Values that live entirely on the stack are integers, pointers and structs or enums of stack-allocated values. This restriction is why you cannot have local variables of type str: the compiler cannot predict the size of the string slice. Instead, the type of your local variable is a reference &str. The reference is based on a pointer of known of known size (8 bytes on a 64-bits system).

You need heap memory to implement dynamically sized collections such as vectors. The values on the stack are allocated for the lifetime of a function call. The heap is more general: you can allocate and free memory on the heap whenever you want. Heap-allocated values are accessed through pointers (memory addresses). For example you can allocate memory in one function, return the corresponding pointer and free its heap-memory in another function. In the case of vectors, Vec is small fixed-size structure with a pointer to heap-allocated data. When you push values to the vector, it automatically ensures that its heap memory has enough space and stores the value there.

This situation is very common when dealing with dynamic data. The dynamic part is allocated on the heap and a small fixed-size struct represents it on the stack. When this struct acts as a simple pointer but automatically controls when its heap-memory should be freed, we call it a "smart pointer".

Rust's Box smart pointers have ownership of their value. They are implemented as raw pointers that free their memory when they are dropped. A Box pointer is the only one pointing to its heap-data so both the pointer and the data live as long. It lets you represent recursive tree structure.

Rc provides shared ownership of a heap value. There can be many Rc pointers for the same value. The value holds a counter of active references ("rc" stands for reference-counted) and is dropped once this number reaches zero. It lets you represent directed acyclic graphs

Weak pointers are an extension to reference-counted values. A Weak pointer can point to a shared value (with other Weak and Rc) but is not tracked with the counter keeping the value alive. The memory is reclaimed when the last Rc is dropped even if there are still some Weak pointing to it. Memory safety is achieved by requiring you to manually upgrade Weak pointer to an Rc before accessing its value. You are also required to deal with possible failure if the value was freed. Rc with Weak pointers let you represent directed acyclic graphs with weak back-references.

With all these pointers, the only thing you can't represent without leaks or invalid pointers are arbitrary object graphs: graphs with cycles. The role of Gc is to support these kinds of object graphs. The memory management becomes more complex: objects are freed once they are no longer reachable. An object is reachable if you can move from pointer to pointer by starting at a value on the stack.