Zero-Cost References with Regions in Vale

Watch regions eliminate reference-counting overhead!

July 29th, 2020 — Evan Ovadia

Note: This article was published before Vale switched to generational references from RC. Now, read-only regions eliminate generation checks, rather than RC overhead.

Vale is rapidly approaching v0.1, 0 and now that we've proven the foundations solid, we can finally share our plans for making Vale extremely fast.

The fastest language currently is C++. 1 It is an incredibly high bar, but we hope to go even faster (or at least come close in the attempt!)

Vale's Fast Mode can already perform as fast as C++ by turning off memory safety, 2 but this article is talking about how Vale's Assist Mode and Resilient Mode might be able to approach C++'s speed with zero unsafety.

C++ is a low-level language, meaning that theoretically, it cannot be beat; given unlimited time, you can optimize C++ code enough to beat anything.

However, in the real world, we don't have unlimited time; we might only have a few days, an evening, or a couple hours to implement what we need. When development time is a factor, we need a language that can optimize as much as possible with as little effort as possible. Even if we did have unlimited time, we want to spend it adding cool features, not optimizing slow code!

Because we don't have unlimited time, development speed and ergonomics matter. Five hours fixing undefined behavior bugs is five hours not spent optimizing. But also, if a language forces us to think hard about and prematurely optimize everything, we spend less time optimizing the hot path. 3

Vale is aiming at the optimal balance: make it easy to make fast code by default, and give the developer powerful tools to optimize the hot path even further.

The challenge here is that memory safety has a run-time cost, no matter what language you're in. 4

To overcome that, Vale can use its unique mix of single ownership and regions to: 5

Reduce safety overhead as close to zero as possible.
Make it ridiculously easy to use bump allocation and pool allocation, such that we use it in more places.

Note that these are works in progress; we'll be implementing these features over the next year or two. Vale is open to contributors, so if you'd like to help bring these ideas into the world, come join us!

If we can achieve this performance, then Vale's blend of speed, safety, and ease could make it the obvious choice for performance-critical software.

To set the stage, let's talk about the safety overhead first.

Side Notes

(interesting tangential thoughts)

Notes [–] Notes [+] 0 1 2 3 4 5

Check out the Roadmap for progress and plans!

See Benchmarks Game. Fortran, C, and Rust are also close to C++'s performance.

One usually uses Assist Mode's very conservative checks to gain confidence that Fast Mode won't crash. See The Next Steps for Single Ownership and RAII for more on how this works.

The "hot path" is the (relatively small) area of code that programs spend most of their time in. For example, a rendering loop in a game engine.

See The Next Steps for Single Ownership and RAII for how every memory safety strategy has run-time costs.

Vale has some other tricks up its sleeve too:

Accelerated Weak Tables, to reduce weak references' cache misses.
"Fast Resilient Mode" which won't use ref-counting or borrow-checking for its memory safety, at the cost of slightly more memory usage.

We'll be posting about these by early next year, but feel free to come by the Vale discord and ask about it before then!

Reference Counting

Vale's Fast Mode doesn't use reference counting, but its Assist Mode and Resilient Mode use it to implement non-owning references (constraint references and weak references). This is heavily optimized (more below), but still incurs some non-zero overhead at run-time.

Whenever we make a new reference to an object, we must increment that object's reference count, 6 and when that reference goes away, we must decrement it again. We can't deallocate an object until its reference count is zero.

The first optimization that will help with this is called compile-time reference counting. Invented by Wouter van Oortmerssen for Lobster, it uses Rust-inspired lifetime analysis to eliminate 95% of increments and decrements at compile time, leaving only 5% to happen at run-time. 7

Now let's look at the overhead of the remaining 5%. Reference counting has what is commonly 8 known as the three vexing fears: cycles, atomicity, mispredictions, and cache-misses. 9 Vale solves all of them.

Cycles

The first weakness of RC is that it can form cycles, causing memory leaks. In practice, Vale doesn't have this problem because every object has one owning reference, which controls when the object is deallocated. 10

For the difference between owning, constraint, and weak references, see The Next Steps for Single Ownership and RAII.

Atomicity

When two threads increment or decrement the same object's reference count, they can interfere with each other. There are various ways to avoid this:

In Python, the incrementing/decrementing is non-atomic, but that means only one thread can run at a given time.
In Swift, we can have multiple threads, but it means every reference count is atomic, which is very slow.

Christian Aichinger tried making Python's ref-counting atomic, and it resulted in a 23% slowdown. This is probably the main reason Swift is slower than C. 11

In Vale, an object can only be modified by one thread at a time, 12 so a program can have threads and still use non-atomic ref-counting.

Branch Mispredictions

RC can also suffer from branch misprediction, where the CPU can't predict whether we'll deallocate the object or not. In Vale, there's no branching at all; letting go of constraint or weak references will never deallocate the object, and letting go of an owning reference will always deallocate it.

Cache Misses

A CPU can non-atomically increment or decrement an integer very quickly; instructions are basically free on modern CPUs. The real bottleneck is in how far the data is: if it's been recently accessed, it's in the nearby cache (the data is "hot"). Otherwise the CPU will "cache miss" and have to bring it in all the way from RAM (the data is "cold").

So, even if we make our ref-counting non-atomic and optimize most of it away, any remaining ref-counts on cold data will still incur cache-miss costs.

Vale can avoid ref-counting on cold data by using read-only regions.

Read-Only Regions

In Vale, we can split our memory into various regions. We can lock a region, and all references into it are completely free; we don't have to increment or decrement its objects' ref counts.

We can do this with implicit locking and explicit locking.

Implicit Locking

Programs often have "pure" functions, where a function...

Reads the outside world through its parameters,
Does a bit of calculation (perhaps modifying some of its own locals along the way),
Returns a result,

...all without modifying the outside world.

In Vale, we can annotate a function with the pure keyword to make the compiler enforce this. This is a common design pattern, and leads to much more maintainable and testable code.

If we add region markers to our pure function, Vale will implicitly lock all existing memory, thus making references to any existing memory completely free; we don't have to increment or decrement anything because:

None of the objects they point to can change because they're immutable.
All these references are temporary, and will go away before we unlock again.

Below, we use region markers to tell the compiler which references point to the outside world.

Let's see it in action! Let's say we have a turn-based game, which runs in Unity. Whenever the player unit acts, each of the enemy units uses the below code to act too.

Each enemy unit figures out what it wants to do most.

To do this, each unit looks at all the things it can do (its abilities, such as Idle, Wander, Chase, Attack), and asks each ability, "what do you want?".

To generate a desire, an ability will look at its unit and the world around it.

An IDesire describes what the unit could do, and how much it wants to do that.

When we have all the IDesires, we sort them to figure out what the strongest one is, and enact it.

By adding the r' to strongestDesire's this &Unit, we're telling the compiler that this will come from a region we call r'.

There's no specific region whose name is r' (rather, r' is how we refer to whatever region contains this, so it's a generic parameter, hence the <'r ro>). The ro specifies that it's a read-only region, making all references into r' free.

This function doesn't change anything about the unit or the world, it just reads them and does calculations.

For example, ChaseAbility's getDesire function will look for the nearest unit, and return a very strong (70!) desire to chase it.

func gameLoop(world &World) {
  foreach unit in world.enemyUnits {
    // Implicit lock happens here!
    desire =
      unit.strongestDesire(world);
    // Now the world is mutable!
    unit.enactDesire(desire);
  }
}

pure func strongestDesire<i' rw, r' imm>( 13
  this &r'Unit)
IDesire<r', i'> 14
'i { 15
  desires =
    this.abilities*.getDesire(); 16
  desires.sort(
    { _.strength() > _.strength() }); 17
  return desires[0];
}

pure func getDesire<i' rw, r' imm>(
  self &r'ChaseAbility)
IDesire<i', r'>
'i {
  unit = self.unit;
  world = unit.world;
  loc = unit.location;
  nearbyUnits =
    world.findNearbyUnits(loc);
  closest = nearbyUnits[0];
  closestLoc = closest.location;
  path =
    world.findPath(loc, closestLoc);
  return ChaseDesire(70, closest, path);
}

struct ChaseDesire<i' rw, r' imm> i' {
  impl IDesire;

  strength Int;
  victim &r'Unit;
  path List<Location>;

  func strength(&self) 18
  Int {
    return self.strength;
  }
}

Notes [–] Notes [+] 6 7 8 9 10 11 12 13 14 15 16 17 18

This can be a constraint ref count or a weak ref count, depending on the reference.

Commonly in some very small circles, that is.

...we had plenty of off-by-one errors when implementing reference counting.

One can make an ownership cycle in theory, but it never actually happens in practice. As any C++ programmer can attest, it's almost impossible to accidentally get a unique_ptr cycle.

See Benchmarks Game.

Similar to Rust, each thread's memory is isolated from the rest, and we can send messages between them, or use a mutex to take turns accessing data safely.

The 'r ro and i' are regions. ' means region, and ro means read-only.

IDesire<'r, 'i> is defined below. It uses region arguments so it can point into multiple regions at once.

This i' here specifies the default region for our allocations and calls.

This is the "map" operator, it calls a method on all elements of a collection.

It's equivalent to:
unit.capabilities.map(
(c){ c.generateImpulse() } )

or in other languages:
unit.capabilites.map(
c => c.generateImpulse() )

The _ means "the argument".

This is equivalent to:
(a,b){a.strength < b.strength}

or in other languages:
(a,b)=>a.strength < b.strength

impl is like @Override in java.

getDesire is a heavy, read-only operation. It doesn't change anything about the unit or the world, but it does breadth-first searches, A* pathfinding, and a bunch of other algorithms, which make (and then let go of) a lot of references into the World.

Without the region markers, every time we make (or let go of) a reference into the unit or anything else in the world, we increment and decrement a ref-count. Worse, the World would be cold, because Unity has probably rendered a few hundred frames since the last turn, and has long since wiped our World from the cache.

With the region markers, the compiler knows that only the things inside the i' region can change, and nothing in the r' region will change, making references into r' completely free. All of our references to this cold data, which would have incurred RC costs, are now free.

There is a caveat: When we return a reference from the implicitly locked call, it increments the ref-count in the object it's pointing to. In the example, ChaseDesire.victim will increment the Unit it's pointing at, as it's returned. 19 20 One can often use explicit locking to avoid this kind of overhead.

Explicit Locking

Implicit locking locked all existing memory, and made a small new region called i' which we could modify. There's a more precise way to manage regions: mutexes! 21

The Vale compiler itself has a great example of when we'd want explicit locking. Six transformation stages translate the source code into intermediate ASTs 22 and eventually into an executable binary. 23 Each stage takes in the previous AST, read-only, and constructs the next AST.

One of those is the "Templar" stage, which reads the inAst and builds the outAst. We can put the inAst in a Mutex, and the outAst in another Mutex. The Templar gets read-only access to the inAstMutex, while it uses it's read-write access to the outAstMutex to build it up.

In the below code, we have an example.

Here, the templar function takes in the inAstMutex.

The inAstMutex starts closed, so we call openro to open it for read-only access.

We then create a new Mutex containing an empty OutAst. We immediately open it in read-write mode.

We give both the outAst and a function from the inAst to translateFunction, so it can make a translated function and add it to outAst.

At the end of templar, the locks are dropped, automatically closing the mutexes, and we return the now-closed outAstMutex.

With our Mutexes and region markers, the compiler can give us free, zero-cost access to everything in the inAst.

func templar(
    inAstMutex &Mutex<InAst>) {
  inAstLock = inAstMutex.openro();
  inAst = inAstLock.contents;

  outAstMutex = Mutex({ OutAst() }); 
  outAstLock =
    outAstMutex.openrw(); 24
  outAst = outAstLock.contents;

  translateFunction(
      inAst.functions[0], &outAst);

  ...

  return outAstMutex;
}

func translateFunction<'a, 't>(
  func &r'InAstFunction,
  outAst &t'OutAst)
OutASTFunction {
  // Read func, add things to outAst.
  ...;
}

We still increment and decrement the ref-counts of objects inside i', but we just made those objects, so they'll likely be hot in the cache.

We can take this even further: we can combine explicit locking and implicit locking, and even do implicit locks from within implicit locks. By layering these locking techniques, we can compound our benefits and speed up our program even more!

Notes [–] Notes [+] 19 20 21 22 23 24

One could say that we're only doing the reference counting we need to, for the result of the function.

There are some potential SIMD opportunities to parallelize these increments.

They aint just for multi-threading anymore!

Stands for Abstract Syntax Tree, which is a simplified version of code, after we've parsed it from the original text.

If you're curious, the six stages are named Scout, Seer, Astronomer, Templar, Hammer, and Midas.

Mutex takes a function which it will call to get its initial value.

Region Memory Strategy

In our game example above, references into r' were completely free. And references into i' were probably hot in the cache, making its reference counting very fast.

How much more can we do? Much more. This is where things get a bit crazy.

Vale's pool and arena allocation can eliminate the ref-counting overhead in i' too, and lets eliminate its malloc and free overhead as well, while we're at it.

The default memory management strategy for a region is to use the heap, which uses malloc and free under the hood.

We can make it so a certain region uses pool allocation, which is much faster. Pool allocation uses a large "slab" of memory, and keeps allocating from the next part of it. 25 It also caches all "freed" structs for future allocations of the same type. When it runs out of memory in the slab, it allocates another one. 26

Functions can also use arena allocation, which doesn't reuse memory, but just keeps allocating it. This can push performance even faster, though one should be careful when using this, as it could cause out-of-memory errors.

Pool allocation's benefits:

It's extremely fast, because instead of an expensive call to malloc, allocation is simply incrementing the "bump pointer" in the underlying slab, or using a cached one. 27
It's very cache-friendly, because all of our allocated objects are right next to each other.
In release mode, we can completely optimize out all constraint reference counting to references inside the pool region, with no loss to safety. 28
We pay no cost to deallocate, because we deallocate the slabs all at once at the end of the function!

Pool allocation's costs:

Since we cache these structs, our memory usage could be higher. For example, if we make 120 Spaceships and let go of 20 of them, those 20 will still be using up memory. That's why pools are good for the span of certain functions, and not the entire program.
Moving objects between regions (e.g. when returning from an implicit lock function that uses a pool region) sometimes requires copying those objects. 29

Used well, a pool allocator can drastically speed up a region.

Notes [–] Notes [+] 25 26 27 28 29

This only applies to objects that would have been on the heap; any objects we put on the stack will still use the stack.

Type-specific pools are 100% safe with no ref-counting overhead, because use-after-free doesn't actually use a freed object, it uses one that's still alive in memory, and of the correct structure.

Internally, this uses a hash-map of free lists by type ID, with some interesting reuse of memory inside the slab.

This is safe because the memory is not reclaimed by anyone else, we're just accessing old data, which isn't a memory safety problem, just a logic problem (and one that would be caught in Assist Mode).

One could say that we're only paying the RC cost for the things we actually return from the function.

For example, we could use pool allocation for this basic breadth-first-search algorithm, that checks for units at every nearby location.

We use the keyword pool after the region declaration i'.

We just made ref-counting free for our findNearbyUnits function, and completely avoided malloc and free overhead. 30

References into the r' region are free because it's read-only.
References into the i' region are free because it uses pool allocation.

pure func findNearbyUnits<r' ro, i' = pool>
  (world &r'World, origin Location)
i'List<&r'Unit>
i' {
  result = List<&r'Unit>(); 31
  exploredSet = HashSet<Location>();
  unexploredQueue =
    Queue<Location>(origin); 32
  unexploredSet =
    HashSet<Location>(origin);
  while (unexploredQueue.nonEmpty()) {
    // Get next loc, mark it explored.
    loc = unexploredQueue.pop();
    unexploredSet.remove(loc);
    exploredSet.add(loc);

    // If there's a unit here, add it.
    if [u] = world.unitsByLoc(loc) {
      result.add(u);
    }

    // Add nearby locs not seen yet.
    newNearbyLocs =
      world.getAdjacentLocations(loc)
        .filter(
          { not exploredSet.has(_) })
        .filter(
          { not unexploredSet.has(_) });
    unexploredQueue.addAll(
      &newNearbyLocs);
    unexploredSet.addAll(
      &newNearbyLocs);
  }
  return result;
}

Notes [–] Notes [+] 30 31 32

The only memory overhead we paid is when we copied findNearbyUnits's i'List<&r'Unit> result from the pool region into the caller's region.

In Vale, List is backed by an array. If one wants a linked list, they can use LinkedList.

Circular queue, backed by an array.

Next Steps

Vale's regions give the programmer incredible flexibility on where and how to optimize their code. Because Vale makes it so much easier to do this kind of optimization, Vale programs could be have performance rivaling even C++.

Over the next year or two, we'll be trying these out, as well as some other ideas on cutting down the reference-counting overhead. Regions and single ownership have never been combined in this way, so we're discovering new potential every day.

If you want to see this happen sooner, or just want to contribute to something cool, we invite you to come join us! 33

We'd love to hear your thoughts on regions and zero-cost references, so leave a comment!

Stay tuned for the next articles, where we talk about Vale's optimizations, pentagonal tiling, Vale's killer app, and more. If you want to learn more before then, come by the r/Vale subreddit or the Vale discord server!

Notes [–] Notes [+] 33

All contributions are welcome! Soon, we're going to:

Finish designing the region borrow checker!
Implement the bump allocator and pooling!
Write a standard library! (sets, hash maps, lists, etc)
Make syntax highlighters! (VSCode, Sublime, Vim, Emacs, etc)
Enable support gdb/lldb for debugging!
Add better error reporting!
Replace the temporary combinator-based parser with a real one!
Add a "show all constraint refs" option in debug mode to our LLVM codegen stage!

If any of this interests you, come join us!

Afterword: Borrow Checking in Vale and Rust

Vale and Rust have similar-sounding goals: safety and speed. We're often asked how the approaches are different, so I like including these afterwords to really dive into the subject. Some preliminary notes:

This is only my opinion, an honest appraisal of Rust's strengths and weaknesses, from experience. Also note that, while we have used Vale (and this style of programming pre-Vale), its performance characteristics are still unproven, so take all of this with a grain of salt.
Regardless of the common focus on safety and speed, Vale is designed for much higher-level use cases such as apps, games, and servers, rather than Rust's use cases which are much closer to the metal. Comparing Rust and Vale is a fun exercise, but not as useful as, say, comparing Vale to C++.

Rust uses the borrow checker to make sure that if there's one reference active to an object, there cannot be another mutable reference active to it, to enforce safety.

Rust's borrow checker has some major benefits:

It's safe!
It forces us to use much more cache-friendly patterns, making our programs very fast. For example, in Rust we often put our objects into Vecs which serve as pools.

However, like any approach, it also has drawbacks:

It's difficult; since the borrow checker does not work for all references, 34 we have to spend a lot of time figuring out how to work around it.

Luckily, this can be learned. After a while, you can understand when and how to work around the borrow checker, and when to fall back on unsafe code.

It makes certain patterns impossible, and one often finds their architecture forced in a certain direction to appease the borrow checker. 35

One can use Rc<RefCell<T>> for a lot of these situations, but it's regarded as a code smell by a lot of the Rust community. 36 If we're to consider a Rust program that has a blend of Rc<RefCell<T>> and its associated run-time overhead, we should also consider other fast low-run-time-overhead languages.
Cell is also useful in certain situations!

Notes [–] Notes [+] 34 35 36

If you translate a program from C++ or C# to Rust for example, you'll find that the borrow checker rejects most of your references and pointers, and you'll have to use indices or Rc<RefCell<T>>.

Some examples:

A doubly linked list, or any other architecture involving back-references.
Attaching an observer to a button, which modifies outside state without permanently freezing it for everyone else, making it useless.
The "component" pattern, where subcomponents of a struct have references to each other.

Personally, I think that it shouldn't be regarded as a code smell. I think using Rc<RefCell<T>> for objects at a higher level and then use borrow checking for the object's members (and sub-objects) is a really solid pattern.

Also check out Cone, a language which is exploring compiler-assisted DIY memory management that allows you to optimize performance and safety across arena, pool, single-owner, RC, tracing GC, and borrowed ref strategies.

Rust is an amazing tradeoff for low-level systems programming (drivers, embedded, operating systems, etc). It still works well in higher-level use cases (apps, certain kinds of games, anything with a lot of interconnected state), but not as well as for the low-level cases.

Vale uses region borrow checking, which operates on groups of objects, rather than each individual object.

Region borrow checking has some major benefits:

Since we aren't applying the borrow checker on a per-object basis, we can alias freely, and are not forced into certain patterns or architectures.
References into immutable regions are free!
We can designate entire regions to use pool allocation, with one keyword. This means that:

Every struct can have its own pool, where we reuse old instances.
Every struct can use the same bump allocator.
An entire region can avoid reference-counting overhead completely.

However, Vale's approach could have some drawbacks:

When not in a region, it uses normal malloc/free and ref-counting under the hood (though an estimated 95% of the ref-counting can be optimized away). This could add up.
For programs that spend 99% of their time in 5% of the code, this is great; we can throw that 5% in a bump-allocating pool region which does most of its outside accesses with implicit locking, and have a super fast program. But if a program doesn't have a particular hot spot, then Vale's approach could be more tedious.

One interesting thing to note is that both Vale and Rust influence the programmer in a certain direction:

Take the "outside world" as immutable.
Do a bunch of calculations.
Calculate the desired changes to the world.
Apply those changes.

This pattern leads to more testable, maintainable code, so this is a very good thing.