Generational references are a new memory management technique that's easy, deterministic, and very fast.
This technique is the first ingredient in Vale's final hybrid-generational memory design, which is even faster. Our eventual goal is to be as fast as Rust, and perhaps even as fast as C++, while being safer than both. 0
This article explains how generational references work, how they compare to reference counting, and what makes it all so fast. 1
Recall that in Vale, an object is freed when its owning reference goes out of scope. An object always has exactly one owning reference pointing to it.
We can have as many non-owning references as we want. 2
In other languages, when a programmer frees an object and then accidentally dereferences a non-owning reference to it, it can cause memory unsafety and vulnerabilities. 3
Our goal is to detect this situation and react to it safely. 4
Generational references use generational malloc, which is like regular malloc, except at the top of every allocation is a generation number, which tracks how many objects have previously been at this memory location.
One could also think of it as describing "I am the nth inhabitant of this memory location".
Freeing an object will increment its generation number. Nobody else ever modifies it.
Later on, we use this number to see if a particular object is still alive, explained further below.
Generational malloc would normally be an adjustment to mimalloc or jemalloc, but we can simulate it with our own genMalloc and genFree functions:
You can find our experimental implementation in genHeap.c.
See HGM's afterword for a hypothetical comparison with Rust!
Vale has three release modes:
Resilient mode uses hybrid-generational memory.
This distinction is similar to C++'s unique_ptr<T> and T*.
Rust partially solves this, but forces complexity on the programmer and doesn't solve the ABA problem. We'd like a solution that's simpler, solves the whole problem, with as little run-time overhead as Rust.
Such as by halting or stack unwinding.
Our experimental implementation doesn't release memory back to the OS until exit, but when a page is empty, the final version will release the page back to the operating system and map its virtual memory to a read-only page containing all 0xFF.
When an allocation's generation can't be incremented any more, it's not used again (at least until we can re-map the page).
Vale's references are generational references. A generational reference has two things:
To create a reference to an object, we get its allocation's generation number, and include it in the reference.
To dereference a generational reference, we do a "liveness check" to see whether the allocation's generation number still matches our reference's target generation. 7
This prevents use-after-free problems, and makes Vale completely memory safe.
It's as if the reference is saying:
and the person who opens the door says:
This is similar to the "generational indices" technique from C++ and Rust, but applied to the entire world instead of just a specific vector.
This will safely halt the program, unless the user is explicitly checking whether something is alive (such as for a weak reference).
Generational references are only the first steps towards hybrid-generational memory, but we decided to run some early experiments to see how it compares to existing memory models.
|Mode||Speed (seconds)||Overhead Compared to Unsafe (seconds)||Overhead Compared to Unsafe (%)|
|RC||54.90 seconds||+11.08 seconds||+25.29%|
|GM||48.57 seconds||+4.75 seconds||+10.84%|
We used the BenchmarkRL terrain generator to gather these numbers, with different values for the --region-override flag: unsafe-fast, naive-rc, and resilient-v3 respectively.
Here, we benchmarked against other flavors of Vale, to isolate the differences between unsafe, reference-counting, and generational references.
Once we implement full hybrid-generational memory, we'll be benchmarking against C++ and Rust, stay tuned!
Generational references have only 10.84% overhead, less than half the cost of reference counting! These are very promising results, and suggest that full hybrid-generational memory could be incredibly fast.
Try it out! In the Vale release, you can find a benchmark folder with scripts to run the benchmarks. You can find the source code for the various approaches here (feel free to swing by the discord server and we can point you to the right files).
Note these caveats! To isolate the difference between generational references and the other approaches:
Once we address these limitations, we can get more precise benchmarks against the other approaches.
Generational references are much easier for the CPU to handle than reference-counted references, because:
We explain these two differences more below.
Reference counting is costly:
As you can see, reference counting incurs a cost whenever we alias or dealias. Generational references don't have that cost. The above snippet would have zero overhead if it used generational references.
Instead, generational references incur a cost whenever we dereference an object:
Reference counting is not very "cache friendly". Adding and subtracting integers is basically free on modern CPUs compared to the real bottleneck, which is how far those integers are: if it's been recently accessed, it's in the nearby cache, and only takes a few CPU cycles to fetch. Otherwise the CPU will "cache miss" and have to bring it in all the way from RAM, which could take hundreds of cycles. 13
In our reference-counted launchShip example, the ship.__ref_count++ could take a few cycles if ship is already in the cache, or hundreds of cycles if it's not.
Generational references are more cache friendly:
Half of these are aliasings and half are dealiasings. Aliasing happens whenever we access a member (e.g. person.name) or make a new reference (e.g. &person).
Many languages are able to skip a lot of the adjustments, using static analysis. For example, Lobster can remove up to 95%. Our experiment doesn't have those optimizations; it compares naive RC to naive generational references.
For a given if-statement, CPUs will predict whether we'll go down the "then" branch or the "else" branch. This is called branch prediction. It guesses based on various factors.
In Vale, when we do a liveness check, we hint to the CPU that it should assume it will succeed; it doesn't have to guess.
However, in RC, when we check if a counter is zero (to know whether to free the object), we don't know what to tell the CPU to expect. It has to guess. If it's wrong, then it has to back up, throw away any effects it's done in the meantime, and go down the correct branch.
No approach gets memory safety for free 14, and the same is true here; generational references use some memory.
This technique uses 8 additional bytes for the generation numbers in allocations and in generational references.
However, we don't incur this cost as much as one might think. 15
In the end, we only use generations when necessary. This is similar to other languages, such as how in C++ we occasionally use shared_ptr and weak_ptr, and in Rust we occasionally use generational_index or Rc. 16
This article is talking about resilient mode (which uses generational references), but extremely memory constrained environments would benefit from developing and testing in assist mode (which is like a super-powered valgrind or asan) and releasing in unsafe mode which has no overhead. 17
Memory safety is never free, except for the most trivial of programs. Cyclone, Rust, ATS, Fortran, and every other language incurs some overhead to ensure safety. This comes in the form of branching and cache miss costs, for example in:
Or large memory costs, for example if objects are stored in vectors.
Note that static analysis, regions, and inlining are not implemented yet; this list is talking about the final design.
We suspect that this approach will use slightly more memory than the average C++ program, and less than the average Rust program.
You can read more about assist mode and unsafe mode here!
If an object is never moved 18 then we can put it on a stack.
Generational memory will have multiple "generational stacks", one for each size class, just like jemalloc and mimalloc have parallel heaps. 19 20 Because of this, and stack-allocated objects' allocation pattern, it will have cache-friendliness similar to a regular stack.
Additionally, when we identify objects that don't need a generation, they can go on the regular stack, not these generational stacks. 21
...or moved only to child calls, and in some cases, only moved to parent calls.
We'll have a stack each for 8b, 16b, 24b, 32b, 48b, 64b, etc. Larger objects will occupy multiple entries.
Each stack will use a free-list because we need to retire a slot in the stack once its u48 generation hits 0xFFFFFFFFFFFF.
For example, an iterator might be owned by one local and never moved, and only hand out references that are guaranteed to not outlive the object, so we can guarantee it doesn't need a generation.
We can support objects containing other objects inline. 22 The only complication is that inline objects are not right after their allocation's generation, as previously assumed. So, we'll:
For example, in C, a struct can live inside another struct's memory, unless it's a pointer.
The generational reference is only the first step towards hybrid-generational memory, and it already beats reference counting.
Hybrid-generational memory adds two layers of optimization:
When hybrid-generational memory is fully realized, we expect it could be as fast as Rust, and almost as fast as C++. 23
We're excited about this, because it gives us raw speed with zero unsafety, and keeps the language easy to learn and use.