Now that we're wrapping up the Vale 0.2 release, we're free to continue one of our early goals: making Vale into the safest native language. 0

Most languages must choose between being safe and being flexible. Vale's generational references give it a unique opportunity to bridge that gap, by solving the leaky unsafe problem found in other language's FFI and unsafe blocks.

This page describes our final design, and what we hope to accomplish. It involves some borderline-insane acrobatics with bitwise xor and rotate, two simultaneous stacks, and inline assembly. Buckle up!

If you're impressed with our track record and believe in the direction we're heading, please consider sponsoring us on github! We can't do this without you, and we appreciate all the support you've shown.

The Challenge: Leaky Unsafe

Normally, when mixing a safe language with an unsafe language, any bugs in the unsafe language can cause problems in the safe language. For example:

  • When Python code sends a Python object into C, if the C code doesn't correctly call Py_INCREF, it will corrupt Python's memory and cause some mysterious behavior later on in the Python code.
  • When Rust hands a reference into C, the C code can type-cast it at will to write arbitrary memory to the Rust object, causing confounding bugs later on in the safe Rust code.
  • When Javascript hands the wrong kind of object to a Typescript function, it causes bugs down the line deep in the Typescript code, even though Typescript has static typing.

This is called "leaky safety", and its bugs are very difficult to track down.

This can also happen when a language has unsafe blocks. If some unsafe code corrupts some memory, it can cause undefined behavior in safe code. For example, see this Rust snippet where an unsafe block corrupts some memory that's later used by the safe code.

In all these cases, we know that the unsafe language was involved somewhere in the chain of events, but since the bugs actually happen later on, in supposedly safe code, there's no easy way to identify which unsafe code was the original culprit.

The Goal

Our goal is to prevent any accidental memory unsafety in the unsafe language from corrupting any memory in the safe language.

To be more specific, our goal is just to prevent accidental problems. We'll address malicious memory unsafety in a bit. For now, we're protecting against bugs, not hostile code. That's a much larger topic, and involves a lot more context about Vale's ultimate plans.

Region Boundary Hardening

Vale protects against these bugs with region boundary hardening. When it's applied to FFI, we call it Separated FFI.

Region boundary hardening is when we:

  • Separate the safe memory from the unsafe memory (such as the memory managed by C). This includes:
    • Using a different stack for the unsafe code.
    • Not allowing safe objects to contain unsafe objects.
    • Not allowing unsafe objects to contain safe objects.
  • Allowing references between the two:
    • A safe object can contain a reference to an unsafe object.
    • An unsafe object can contain a reference to a safe object if it is scrambled.
  • Data can be passed between the two by copying, in other words, message passing.

Let's explore each of these points!

References Between the Two

One of the reasons it's risky to call into an unsafe language is because they can use a pointer to access a Vale object in ways that corrupt the Vale object.

exported struct Engine { fuel int; }
exported struct Ship { engine ^Engine; }
exported func main() {
s = Ship(^Engine(42));

extern func halfFuel(s &Ship) int;
extern int myproject_halfFuel(myproject_Ship* s) { // Whoops, accidentally overwrote a pointer! *(int*)ship->engine = ship->engine->fuel / 2; }

Luckily, the Vale compiler doesn't just hand out pointers to our objects.

It instead gives an opaque handle, which is 32 bytes. The 32 bytes contains:

  • A 16B generational reference to the object , which contains:
    • A pointer to the object.
    • The object's generation.
  • A 16B generational reference to the object's region, which contains:
    • A pointer to the object's region.
    • The region's generation.

This 32 bytes is then scrambled:

  • We xor the object's Type ID into that last part (the region's generation).
  • We xor the entire 32 bytes by a constant factor.
  • We rotate the entire 32 bytes by a constant factor.

These constant factors are randomly generated at compile time.

Why do we do all this?

This should make it just difficult enough to dissuade anyone looking for a "quick fix" involving accessing Vale objects' data.

How does C read the data then?

The C code will need to hand it back to a Vale function, like the example here.

When a Vale function receives a scrambled reference, it will unscramble it and generation-check the region. If the C code tampered with it at all, it will be detected right then.

Message Passing

As shown in this example, when we send immutable data between C and Vale, we're actually sending a copy.

The C code can do whatever it likes with this copy, and there's no risk of corrupting Vale objects.

Separating Memory

One of the reasons it's risky to call into an unsafe language is because they can do buffer overruns on the stack, like this C snippet:

void sinisterCFunction() { int myArray[10]; myArray[-5] = 7; myArray[15] = 7; }

This function is particularly sinister, because it will overwrite its caller's memory. Perhaps our caller was this Vale function:

struct Ship {
engine ^Engine; // heap-allocated Engine

func myValeFunction() {
ship Ship = Ship(^Engine(42));

// Call the C function

Here, the C function is reaching up the stack, into the caller's memory, and changing something there. This might make ship.engine point to address 0x7.

To solve this particular problem, we run the C code on a secondary stack.

This involves some inline assembly, which will set the stack pointer to some new memory, and then call a "wrapper" function using that new stack.

asm volatile( // Set the stack pointer to new_stack_top. "mov %[rs], %%rsp \n" // Call sinisterCFunction_wrapper function. "call *%[bz] \n" : [ rs ] "+r" (new_stack_top), [ bz ] "+r" (sinisterCFunction_wrapper) :: );

As you can see, we call some sort of sinisterCFunction_wrapper on the "new stack".

This wrapper function will call sinisterCFunction, and when it's done, it will jump back to our original stack.

Here is the wrapper:

void sinisterCFunction_wrapper() { // Extract args from thread local storage: size_t original_stack_state_scrambled = thread_local_current_wrapper_args->original_stack_state_scrambled; // If sinisterCFunction had any arguments, we'd read them here. // Call the actual sinisterCFunction. sinisterCFunction(); // Jump back to the safe stack. // This will undo the stack pointer to what it was when we called setjmp. // Supplying the 1 will send it into its else block. longjmp(*(jmp_buf*)unscramblePtr(original_stack_state_scrambled), 1); }

longjmp is another way to switch stacks. Here, we're using it to switch back to our original stack. Our "original stack state" was stored scrambled in thread local storage.

Remember that assembly code we saw above? Below we see it in context. This C code sets up the original stack state, puts a pointer to it in thread local storage, and then uses the assembly code to switch to the new stack.

// Set up the original_stack_state, which the other stack will use // to switch back to here. jmp_buf original_stack_state; if (setjmp(original_stack_state) == 0) { // Put the return destination into the thread-local "wrapper args". SinisterCwrapperFunctionArgs args = { scramblePtr(&original_stack_state), // If sinisterCFunction had any arguments, they would go here. }; thread_local_current_wrapper_args = &args; asm volatile( // Set the stack pointer to new_stack_top. "mov %[rs], %%rsp \n" // Call sinisterCFunction_wrapper function. "call *%[bz] \n" : [ rs ] "+r" (new_stack_top), [ bz ] "+r" (sinisterCFunction_wrapper) :: ); } else { // Continue on }

Putting it all together

We've described three mechanisms to help protect our Vale objects from accidental bugs in our C code:

  • Message passing
  • Scrambling references
  • Separate stacks

With these, there's no way for C code to accidentally get a pointer to a Vale object and corrupt it.

This is already a huge win. Now, instead of trusting in your dependencies' skill and discipline, you only need to trust their intentions.

Safety with Unsafe Dependencies

However, if someone has whitelisted a library to use unsafe for some reason (despite the warnings they had to get past), those library authors could use a supply chain attack to get some C code into the program which works around these measures to corrupt the Vale data.

There are a couple ways we could theoretically protect a program from even that:

  • Process Isolation: We can run all C code in a separate process.
  • WebAssembly Isolation: We can run all C code in a WebAssembly instance.

Unsafe Blocks

We can also use region boundary hardening to support unsafe blocks in Vale. The one difference: instead of fully switching from the safe stack to the unsafe stack, we'd operate on both simultaneously.

Draft Notes and To Do

  • Add a conclusion
  • More details on the unsafe blocks
  • Can we compile wasm to C again? Then it can have checks that all accesses are within a 4gb range. I think this is what is getting at.
  • Mention how this is a big improvement if all parties are acting in good faith. Isolation helps against bad actors.
  • Somehow work in how people don't need unsafe to optimize.
    • They can use the check override !! operator.
    • Also, its disabled by default.
    • We can enable those, or even automatically have them in the entire module.
    • We can have a canary running with them on, and every one else running with them off, to detect if there's any shenanigans.
      • Maybe the canary can also run with deterministic replayability?

"This is already a huge win. The vast majority of Vale libraries don't use unsafe operations, and instead use the standard library instead. So if you haven't whitelisted any dependencies to use unsafe code, and you know that you're not intentionally working around these mechanisms yourself, you can be confident that your data won't be corrupted."

Side Notes
(interesting tangential thoughts)

By native, we mean not running in a VM.