LanguagesArchitecture

Now that we're wrapping up the Vale 0.2 release, we're turning our sights towards the next step in Vale's ambitious goals: perfect replayability.

Perfect replayability (also known as deterministic replayability) is how we can execute a program twice, and guarantee that the second run will behave exactly as the first, even in the presence of multithreading.

Our goal is to completely eliminate all heisenbugs, so nobody ever has to spend hours trying to reproduce a tricky bug ever again. This page describes our final design, and what we hope to accomplish.

If you're impressed with our track record and believe in the direction we're heading, please consider sponsoring us on github! We can't do this without you, and we appreciate all the support you've shown.

What and Why

One of the biggest challenges in any debugging session is reproducing the problem. There are so many uncontrollable factors that affect whether a bug happens:

  • Network latency
  • Thread scheduling
  • Time of day
  • Random number generators
  • User input
  • Animation delays

It can be nigh impossible to reproduce certain bugs. These are called "heisenbugs," because they always seem to appear when you aren't looking for them, and then disappear when you try to study them. 0

Often, we spend hours (a "hunting trip", as we say in the biz) trying to reproduce it in a debugger, hoping that we hit a certain breakpoint.

Then, when it hits, we celebrate, and then sober up and put our war faces on. We must step gingerly to make sure we don't hit "continue" in the debugger, thus losing our current state.

We also can't re-run the program, because it took hours to reproduce the bug even this once.

And since we can't re-run the program, we can't add any more printouts! We're stuck, unable to move.

At this point, all we can do is inspect the current state of the program, hoping that there's enough information there, hoping that there are enough printouts, hoping that we can identify the root cause.

And once we have a fix, we need another hunting trip to reproduce the problem again, to see if our fix worked. However, we never quite know if a successful run was successful because of our fix, or because the heisenbug is hiding again.

It would be amazing if instead, we could perfectly reproduce these problems, and then add printouts and even refactor our code, without additional hunting trips.

Sounds like a fantasy, right? How can this be possible?

Overview

In short, we eliminate all undefined behavior, remove as many sources of nondeterminism as possible, and record the rest.

Let's start with how we can record inputs!

Recording Inputs

We can start up our program in "recording mode", where Vale records all data that comes in via FFI, for example:

  • Any data coming in from the network.
  • Any timestamps such as from GetTime.
  • Any data coming from standard in.

Any other FFI inputs are also recorded.

Then, when we start the program in "replaying mode", whenever the program attempts to call that FFI function, it will instead read from that recording file.

We can add exceptions; for example, we can instruct it to never skip println. Now we can see all our printouts when we replay!

Undefined Behavior

For a language to have perfect replayability, it can't have any undefined behavior.

This isn't difficult for most languages, unless they offer unsafe capabilities. Vale has three unsafe capabilities:

  • The "check override" operator !! which skips a generation check if it wasn't already elided by the Vale compiler.
  • FFI functions, covered above.
  • The unsafe block, as described in The Unsafe Block.

The check override is disabled by default in libraries, so they're expected to work properly when safety checks are enabled.

We handle the unsafe blocks similarly to how we handled FFI; record their effects on the outside world, so we can replay them. 1

Remove Nondeterminism

There are some sources of nondeterminism we had to carefully avoid, when designing Vale.

For example, casting a pointer to an integer is nondeterministic in any language, because memory addresses are randomly determined at run-time, because of Address Space Layout Randomization. For now, we've removed that ability completely in Vale. We may add it back sometime in the future, with a promising "deterministic mapping allocator" which compensates for ASLR.

Another source of nondeterminism is uninitialized memory. For example, when we allocate an array and read it out-of-bounds, it's impossible to know what data it will read. So, we made sure there was no way to read uninitialized memory.

Handle Multi-threading

Making a program deterministic when there's multi-threading is actually simpler than one might assume. Basically, we:

  • Make a recording file for each OS thread (not each green thread, which are already deterministically scheduled).
  • Record the "message ID" of every message it receives through a channel.
  • Record the "mutex lock count" of every mutex it opens.

Resilience

With the above measures, we find an amazing capability emerges: resilience.

We have "pure resilience", the ability to add pure function calls and printouts to our program, and be able to use the same recording. After all, it calls the same FFI functions in the same order, so why not?

We also have some "impure resilience", the ability to be able to refactor our program to a surprising extent, and still be able to use the same recording. If it calls the same FFI functions in the same order, then we can refactor quite a bit!

In fact, this resilience is what would separate Vale's perfect replayability from existing technologies like rr which just record a single execution, and dont allow modifying the source code.

Conclusion

We've discussed a pretty radical idea, in pretty broad strokes! Those who want to read more on the implementation details are welcome to look at our internal designs for some more details.

We hope you enjoyed this! And if you believe in the direction we're heading, please consider sponsoring us on github!

With your support, we can make this happen, and bring an end to all heisenbugs!

- Evan Ovadia 2

Side Notes
(interesting tangential thoughts)
0

This isn't actually how software bugs work, but it seems like it!

1

This is actually pretty complex in practice, see our internal notes if you'd like to know more on this.

2

Draft notes:

  • Add more to the multi-threading section, on replaying.
  • Split up the What and Why section somehow, its very long.
  • Define nondeterminism?

About the Vale Language Project

The Vale Language Project is not just about making Vale, it's also about exploring, discovering, and publishing new programming language mechanisms that enable speed, safety, and ease of use.

The world needs more exploration here! Currently, most programming language research is in:

  • High-overhead languages involving reference counting and tracing garbage collection.
  • Complex languages (Ada/Spark, Coq, Rust, Haskell, etc.) which impose a higher complexity burden on the average programmer.

These are useful, but there is a vast field of possibilities in between, waiting to be explored!

Our aim is to explore that space, discover what it has to offer, and make speed and safety easier than ever before.

In this quest, we've discovered a lot of new techniques:

These techniques have also opened up some new emergent possibilities:

  • Seamless concurrency, the ability to launch multiple threads that can access any pre-existing data without data races, without the need for refactoring the code or the data.
  • Object pools and bump-allocators that are memory-safe and decoupled, so no refactoring needed.
  • Separated FFI, which allows us to call into C without risk of accidentally corrupting Vale objects.
  • Deterministic replayability, to record all inputs and replay execution. Goodbye races and heisenbugs!
  • Higher RAII, a form of linear typing that enables destructors with parameters and returns.

We also gain a lot of inspiration from other languages, and are finding new ways to combine their techniques:

  • We can mix an unsafe block with Separated FFI to make a much safer systems programming language!
  • We can mix Erlang's isolation benefits with functional reactive programming to make much more resilient programs!
  • We can mix region borrow checking with Pony's iso to support shared mutability.

...plus a lot more interesting ideas to explore!

The Vale programming language is only one combination of the features we've found. Our goal is to publish all the techniques we've found, even the ones that couldn't fit in Vale, so that other languages can make strides in this area.

Our medium-term goals:

  • Publish the Language Simplicity Manifesto, a collection of principles to keep programming languages' learning curves down.
  • Publish the Memory Safety Grimoire, a collection of "memory safety building blocks" that languages can potentially use to make new memory models, just like Vale combined generational references and scope tethering.
  • Prototype the Region Borrow Checker in Vale, to show the world that shared mutability can work with borrow checking!
  • Prototype Hybrid-Generational Memory in Vale, to see how fast and easy we can make single ownership.

We aim to publish articles biweekly on all of these topics, and inspire the next generation of fast, safe, and easy programming languages.

If you want to support our work, please consider sponsoring us on GitHub!

With enough sponsorship, we can:

  • Work on this full-time.
  • Turn the Vale Language Project into a 501(c)(3) non-profit organization.
  • Make Vale into a production-ready language, and push it into the mainstream!