Now that we're wrapping up the Vale 0.2 release, we're turning our sights towards the next step in Vale's ambitious goals: perfect replayability.
Perfect replayability (also known as deterministic replayability) is how we can execute a program twice, and guarantee that the second run will behave exactly as the first, even in the presence of multithreading.
Our goal is to completely eliminate all heisenbugs, so nobody ever has to spend hours trying to reproduce a tricky bug ever again. This page describes our final design, and what we hope to accomplish.
If you're impressed with our track record and believe in the direction we're heading, please consider sponsoring us on github! We can't do this without you, and we appreciate all the support you've shown.
One of the biggest challenges in any debugging session is reproducing the problem. There are so many uncontrollable factors that affect whether a bug happens:
It can be nigh impossible to reproduce certain bugs. These are called "heisenbugs," because they always seem to appear when you aren't looking for them, and then disappear when you try to study them. 0
Often, we spend hours (a "hunting trip", as we say in the biz) trying to reproduce it in a debugger, hoping that we hit a certain breakpoint.
Then, when it hits, we celebrate, and then sober up and put our war faces on. We must step gingerly to make sure we don't hit "continue" in the debugger, thus losing our current state.
We also can't re-run the program, because it took hours to reproduce the bug even this once.
And since we can't re-run the program, we can't add any more printouts! We're stuck, unable to move.
At this point, all we can do is inspect the current state of the program, hoping that there's enough information there, hoping that there are enough printouts, hoping that we can identify the root cause.
And once we have a fix, we need another hunting trip to reproduce the problem again, to see if our fix worked. However, we never quite know if a successful run was successful because of our fix, or because the heisenbug is hiding again.
It would be amazing if instead, we could perfectly reproduce these problems, and then add printouts and even refactor our code, without additional hunting trips.
Sounds like a fantasy, right? How can this be possible?
In short, we eliminate all undefined behavior, remove as many sources of nondeterminism as possible, and record the rest.
Let's start with how we can record inputs!
We can start up our program in "recording mode", where Vale records all data that comes in via FFI, for example:
Any other FFI inputs are also recorded.
Then, when we start the program in "replaying mode", whenever the program attempts to call that FFI function, it will instead read from that recording file.
We can add exceptions; for example, we can instruct it to never skip println. Now we can see all our printouts when we replay!
For a language to have perfect replayability, it can't have any undefined behavior.
This isn't difficult for most languages, unless they offer unsafe capabilities. Vale has three unsafe capabilities:
The check override is disabled by default in libraries, so they're expected to work properly when safety checks are enabled.
We handle the unsafe blocks similarly to how we handled FFI; record their effects on the outside world, so we can replay them. 1
There are some sources of nondeterminism we had to carefully avoid, when designing Vale.
For example, casting a pointer to an integer is nondeterministic in any language, because memory addresses are randomly determined at run-time, because of Address Space Layout Randomization. For now, we've removed that ability completely in Vale. We may add it back sometime in the future, with a promising "deterministic mapping allocator" which compensates for ASLR.
Another source of nondeterminism is uninitialized memory. For example, when we allocate an array and read it out-of-bounds, it's impossible to know what data it will read. So, we made sure there was no way to read uninitialized memory.
Making a program deterministic when there's multi-threading is actually simpler than one might assume. Basically, we:
With the above measures, we find an amazing capability emerges: resilience.
We have "pure resilience", the ability to add pure function calls and printouts to our program, and be able to use the same recording. After all, it calls the same FFI functions in the same order, so why not?
We also have some "impure resilience", the ability to be able to refactor our program to a surprising extent, and still be able to use the same recording. If it calls the same FFI functions in the same order, then we can refactor quite a bit!
In fact, this resilience is what would separate Vale's perfect replayability from existing technologies like rr which just record a single execution, and dont allow modifying the source code.
We've discussed a pretty radical idea, in pretty broad strokes! Those who want to read more on the implementation details are welcome to look at our internal designs for some more details.
We hope you enjoyed this! And if you believe in the direction we're heading, please consider sponsoring us on github!
With your support, we can make this happen, and bring an end to all heisenbugs!
- Evan Ovadia 2
This isn't actually how software bugs work, but it seems like it!
This is actually pretty complex in practice, see our internal notes if you'd like to know more on this.
Draft notes:
The Vale Language Project is not just about making Vale, it's also about exploring, discovering, and publishing new programming language mechanisms that enable speed, safety, and ease of use.
The world needs more exploration here! Currently, most programming language research is in:
These are useful, but there is a vast field of possibilities in between, waiting to be explored!
Our aim is to explore that space, discover what it has to offer, and make speed and safety easier than ever before.
In this quest, we've discovered a lot of new techniques:
These techniques have also opened up some new emergent possibilities:
We also gain a lot of inspiration from other languages, and are finding new ways to combine their techniques:
...plus a lot more interesting ideas to explore!
The Vale programming language is only one combination of the features we've found. Our goal is to publish all the techniques we've found, even the ones that couldn't fit in Vale, so that other languages can make strides in this area.
Our medium-term goals:
We aim to publish articles biweekly on all of these topics, and inspire the next generation of fast, safe, and easy programming languages.
If you want to support our work, please consider sponsoring us on GitHub!
With enough sponsorship, we can: