Generational References

2.3x faster than reference counting, unoptimized!

Jan 5 2021 — Evan Ovadia

This is an out-of-date version of the article Vale's Memory Safety Strategy: Generational References and Regions

Generational references are a new memory management technique that's easy, deterministic, and very fast.

This technique is the first ingredient in Vale's final hybrid-generational memory design, which is even faster. Our eventual goal is to be as fast as Rust, and perhaps even as fast as C++, while being safer than both. 0

This article explains how generational references work, how they compare to reference counting, and what makes it all so fast. 1

Built on Single Ownership

Recall that in Vale, an object is freed when its owning reference goes out of scope. An object always has exactly one owning reference pointing to it.

We can have as many non-owning references as we want. 2

In other languages, when a programmer frees an object and then accidentally dereferences a non-owning reference to it, it can cause memory unsafety and vulnerabilities. 3

Our goal is to detect this situation and react to it safely. 4

Generational Malloc and the Sacred Integer

Generational references use generational malloc, which is like regular malloc, except at the top of every allocation is a generation number, which tracks how many objects have previously been at this memory location.

One could also think of it as describing "I am the nth inhabitant of this memory location".

Freeing an object will increment its generation number. Nobody else ever modifies it.

Later on, we use this number to see if a particular object is still alive, explained further below.

Generational malloc would normally be an adjustment to mimalloc or jemalloc, but we can simulate it with our own genMalloc and genFree functions:

genFree increments the generation number, and instead of calling free 5 6, remembers the allocation in a free-list. There's a free-list for every size class (16b, 24b, 32b, <=48b, <=64b, <=128b, etc).
genMalloc pulls from a free-list if possible. If it's empty, it calls malloc and initializes the generation number to 1.

You can find our experimental implementation in genHeap.c.

Side Notes

(interesting tangential thoughts)

Notes [–] Notes [+] 0 1 2 3 4 5 6

0

See HGM's afterword for a hypothetical comparison with Rust!

1

Vale has three release modes:

Resilient: Fast and safe.
Assist: for development, detects logic problems.
Unsafe: turns off all safety.

Resilient mode uses hybrid-generational memory.

2

This distinction is similar to C++'s unique_ptr<T> and T*.

3

Rust partially solves this, but forces complexity on the programmer and doesn't solve the ABA problem. We'd like a solution that's simpler, solves the whole problem, with as little run-time overhead as Rust.

4

Such as by halting or stack unwinding.

5

Our experimental implementation doesn't release memory back to the OS until exit, but when a page is empty, the final version will release the page back to the operating system and map its virtual memory to a read-only page containing all 0xFF.

6

When an allocation's generation can't be incremented any more, it's not used again (at least until we can re-map the page).

Generational Reference: More than just a pointer!

Vale's references are generational references. A generational reference has two things:

A pointer to the object.
A "target generation" integer.

To create a reference to an object, we get its allocation's generation number, and include it in the reference.

Dereferencing

To dereference a generational reference, we do a "liveness check" to see whether the allocation's generation number still matches our reference's target generation. 7

This prevents use-after-free problems, and makes Vale completely memory safe.

It's as if the reference is saying:

"Hello! I'm looking for the 11th inhabitant of this house, are they still around?"

and the person who opens the door says:

"No, sorry, I'm the 12th inhabitant of this house, the 11th inhabitant is no more." 8

or instead:

"Yes! That is me. Which of my fields would you like to access?"

Notes [–] Notes [+] 7 8

7

This is similar to the "generational indices" technique from C++ and Rust, but applied to the entire world instead of just a specific vector.

8

This will safely halt the program, unless the user is explicitly checking whether something is alive (such as for a weak reference).

Speed

Generational references are only the first steps towards hybrid-generational memory, but we decided to run some early experiments to see how it compares to existing memory models.

For this experiment, we benchmarked 9 10 three flavors of Vale:

Unsafe, with no memory safety, the equivalent of C++ (minus caveats, see below!)
RC, where we use naive reference counting for all our objects.
GM, which uses generational references.

Mode	Speed (seconds)	Overhead Compared to Unsafe (seconds)	Overhead Compared to Unsafe (%)
Unsafe	43.82 seconds	n/a	n/a
RC	54.90 seconds	+11.08 seconds	+25.29%
GM	48.57 seconds	+4.75 seconds	+10.84%

Notes [–] Notes [+] 9 10

9

We used the BenchmarkRL terrain generator to gather these numbers, with different values for the --region-override flag: unsafe-fast, naive-rc, and resilient-v3 respectively.

10

Here, we benchmarked against other flavors of Vale, to isolate the differences between unsafe, reference-counting, and generational references.

Once we implement regions and inline data, we'll be benchmarking against C++ and Rust, stay tuned!

Generational references have only 10.84% overhead, less than half the cost of reference counting! These are very promising results, and suggest that full hybrid-generational memory could be incredibly fast.

Try it out! In the Vale release, you can find a benchmark folder with scripts to run the benchmarks. You can find the source code for the various approaches here (feel free to swing by the discord server and we can point you to the right files).

Note these caveats! To isolate the difference between generational references and the other approaches:

In all flavors, we only allocate objects on the heap, except for primitives. Future versions will add stack allocations.
We used genHeap.c for all versions, though only GM ever touches the generation number, the other versions ignore it. Future versions will integrate generational malloc into jemalloc or mimalloc directly.

Once we address these limitations, we can get more precise benchmarks against the other approaches.

Why is this so fast?

Generational references are much easier for the CPU to handle than reference-counted references, because:

Generational references have no aliasing/dealiasing overhead, just on dereference.
Generational references cause less cache misses.
Liveness checks' branching is easier to predict than RC decrements' branching.

We explain these two differences more below.

No Aliasing Costs

Reference counting is costly:

Whenever we "alias" (make a new reference to an object), we have to dereference the object to increment its counter.
Whenever we "dealias" (throw away a reference), we have to:

Dereference the object to decrement its counter,
If the counter is zero, deallocate it.

For example:

func launchShip(ships &Map<int, &Spaceship>, id int, armada &List<&Spaceship>) {
  ship = ships.get(id);
  // Increment ship's counter:
  //   ship.__ref_count++;

  armada.add(ship);

  // Decrement ship's counter:
  //   ship.__ref_count--;
  // Deallocate ship if counter is zero:
  //   if (ship.__ref_count == 0) `
  //     ship.__deallocate();
  //   }
}

As you can see, reference counting incurs a cost whenever we alias or dealias. Generational references don't have that cost. The above snippet would have zero overhead if it used generational references.

Instead, generational references incur a cost whenever we dereference an object:

func getShipName(ships &Map<int, &Spaceship>, id int) str {
  ship = ships.get(id);

  // Check if ship is still alive:
  //   assert(shipRef.targetGen == shipRef.obj.actualGen);
  return ship.name;
}

This is cheaper because programs dereference less than they alias and dealias: our sample program had 4.7 million counter adjustments, but only 1.3 million liveness checks. 11 12

More Cache Friendly

Reference counting is not very "cache friendly". Adding and subtracting integers is basically free on modern CPUs compared to the real bottleneck, which is how far those integers are: if it's been recently accessed, it's in the nearby cache, and only takes a few CPU cycles to fetch. Otherwise the CPU will "cache miss" and have to bring it in all the way from RAM, which could take hundreds of cycles. 13

In our reference-counted launchShip example, the ship.__ref_count++ could take a few cycles if ship is already in the cache, or hundreds of cycles if it's not.

Generational references are more cache friendly:

When a generational reference goes away, we don't need to reach into memory (unlike RC, where we have to decrement a counter).
We don't need to increment when aliasing (see previous section); we don't need to reach into memory to increment.

Notes [–] Notes [+] 11 12 13

11

Half of these are aliasings and half are dealiasings. Aliasing happens whenever we access a member (e.g. person.name) or make a new reference (e.g. &person).

12

Many languages are able to skip a lot of the adjustments, using static analysis. For example, Lobster can remove up to 95%. Our experiment doesn't have those optimizations; it compares naive RC to naive generational references.

13

See Understanding CPU caching and performance and Approximate cost to access various caches and main memory.

Better Branch Prediction

For a given if-statement, CPUs will predict whether we'll go down the "then" branch or the "else" branch. This is called branch prediction. It guesses based on various factors.

In Vale, when we do a liveness check, we hint to the CPU that it should assume it will succeed; it doesn't have to guess.

However, in RC, when we check if a counter is zero (to know whether to free the object), we don't know what to tell the CPU to expect. It has to guess. If it's wrong, then it has to back up, throw away any effects it's done in the meantime, and go down the correct branch.

Memory Usage

No approach gets memory safety for free 14, and the same is true here; generational references use some memory.

This technique uses 8 additional bytes for the generation numbers in allocations and in generational references.

However, we don't incur this cost as much as one might think. 15

Vale only uses generational references for non-owning references. The vast majority of references in a program's heap are owning references, which don't need a target generation number, and so don't require the extra 8 bytes.
Inline objects (ones that are embedded into the containing struct's memory) don't need a generation number, they reuse the generation of the struct that contains them.

Static analysis can make objects inline whenever possible.

Static analysis can eliminate generational references, and sometimes even the allocation's generation.

If it sees that a particular reference doesn't escape the lifetime of the object, it makes that reference a regular pointer; no generation number.
For an object, if no references escape the lifetime of the object, it also takes out the allocation's generation number.

Small immutable objects (<=32b) are copied rather than using any sort of overhead.

This is particularly relevant for large arrays of ints, chars, etc.
This can also be specified with the inl keyword.

Objects inside an alternate region (such as arena or pool) have no overhead. Region calling makes this very easy.
References into a locked region only need a pointer, not a generation, see implicit locking for more on this.

In the end, we only use generations when necessary. This is similar to other languages, such as how in C++ we occasionally use shared_ptr and weak_ptr, and in Rust we occasionally use generational_index or Rc. 16

This article is talking about resilient mode (which uses generational references), but extremely memory constrained environments would benefit from developing and testing in assist mode (which is like a super-powered valgrind or asan) and releasing in unsafe mode which has no overhead. 17

Notes [–] Notes [+] 14 15 16 17

14

Memory safety is never free, except for the most trivial of programs. Cyclone, Rust, ATS, Fortran, and every other language incurs some overhead to ensure safety. This comes in the form of branching and cache miss costs, for example in:

Array bounds checks.
Check if something's still alive, e.g. with booleans or generational indices.
Reference count increments/decrements.

Or large memory costs, for example if objects are stored in vectors.

15

Note that static analysis, regions, and inlining are not implemented yet; this list is talking about the final design.

16

We suspect that this approach will use slightly more memory than the average C++ program, and less than the average Rust program.

17

You can read more about assist mode and unsafe mode here!

What's Next?

Stack Allocations

If an object is never moved 18 then we can put it on a stack.

Generational memory will have multiple "generational stacks", one for each size class, just like jemalloc and mimalloc have parallel heaps. 19 20 Because of this, and stack-allocated objects' allocation pattern, it will have cache-friendliness similar to a regular stack.

Additionally, when we identify objects that don't need a generation, they can go on the regular stack, not these generational stacks. 21

Notes [–] Notes [+] 18 19 20 21

18

...or moved only to child calls, and in some cases, only moved to parent calls.

19

We'll have a stack each for 8b, 16b, 24b, 32b, 48b, 64b, etc. Larger objects will occupy multiple entries.

20

Each stack will use a free-list because we need to retire a slot in the stack once its u48 generation hits 0xFFFFFFFFFFFF.

21

For example, an iterator might be owned by one local and never moved, and only hand out references that are guaranteed to not outlive the object, so we can guarantee it doesn't need a generation.

Inline Objects

We can support objects containing other objects inline. 22 The only complication is that inline objects are not right after their allocation's generation, as previously assumed. So, we'll:

Add a u16 to the reference, an offset from the object's start to the allocation's generation.
Change how we get a generation: instead of just dereferencing the object pointer, subtract the u16 offset from it first.

Notes [–] Notes [+] 22

22

For example, in C, a struct can live inside another struct's memory, unless it's a pointer.

Hybrid-Generational Memory

The generational reference is only the first step towards hybrid-generational memory, and it already beats reference counting.

Hybrid-generational memory adds two layers of optimization:

Static analysis, to skip liveness checks.
Scope tethering, to keep an object alive longer.

When hybrid-generational memory is fully realized, we expect it could be as fast as Rust, and almost as fast as C++. 23

We're excited about this, because it gives us raw speed with zero unsafety, and keeps the language easy to learn and use.

See Hybrid-Generational Memory to learn more, and feel free to swing by the discord server with any questions or ideas!

Notes [–] Notes [+] 23

23

See Hybrid-Generational Memory for some comparison with Rust!