Heisenbugs: The most elusive kind of bug, and how to capture them with Perfect Replayability

Eliminate heisenbugs and endless debugging sessions!

Jun 29, 2022 — Evan Ovadia

The Vale 0.2 release is out, and it includes a prototype for one of Vale's most productive features: perfect replayability.

Perfect replayability is a language feature where we can execute a program twice, and guarantee that the second run will behave exactly as the first, even in the presence of multithreading. 0 1

With this feature, we eliminate all heisenbugs, so nobody ever has to spend hours trying to reproduce a tricky bug ever again.

It also makes our programs completely deterministic, which is invaluable for many kinds of networked applications, especially multiplayer games.

By using this technique, we were able to finish making Incendian Falls in time and successfully complete the 7DRL challenge.

The prototype was just merged in, you can play with this feature today.

What and Why

One of the biggest challenges in any debugging session is reproducing the problem. There are so many uncontrollable factors that affect whether a bug happens:

Network latency
Thread scheduling
Time of day
Random number generators
User input
Animation delays

It can be nigh impossible to reproduce certain bugs. These are called "heisenbugs," because they always seem to appear when you aren't looking for them, and then disappear when you try to study them. 2

Often, we spend hours (a "hunting trip", as we say in the biz) trying to reproduce it in a debugger, hoping that we hit a certain breakpoint.

Then, when it hits, we celebrate! And then we sober up and put our detective hats on. We must step gingerly to make sure we don't hit "continue" in the debugger, thus losing our current state.

We also can't re-run the program, because it took hours to reproduce the bug even this once.

And since we can't re-run the program, we can't add any more printouts! We're stuck, unable to move.

At this point, all we can do is inspect the current state of the program, hoping that there's enough information there, hoping that there are enough printouts, hoping that we can identify the root cause.

And once we have a fix, we need another hunting trip to reproduce the problem again, to see if our fix worked. However, we never quite know if a successful run was successful because of our fix, or because the heisenbug is hiding again.

It would be amazing if instead, we could perfectly reproduce these problems, and then add printouts and even refactor our code, without additional hunting trips.

Sounds like a fantasy, right? How can this be possible?

Side Notes

(interesting tangential thoughts)

Notes [–] Notes [+] 0 1 2

This is pretty difficult to achieve in most languages, because they have undefined behavior, unsafe operations, and so on. Some languages even expose nondeterminism in their standard library (C#'s string.GetHashCode(), Java's WeakReference) with no way to make them deterministic.

This is just a prototype, and doesn't support multithreading yet. Read below for how we'll make it possible.

This isn't actually how software bugs work, but it seems like it!

Overview

In short, we eliminate all undefined behavior, remove as many sources of nondeterminism 3 as possible, and record the rest.

Let's start with how we can record inputs!

Recording Inputs

We can start up our program in "recording mode", where Vale records all data that comes in via FFI, for example:

Any data coming in from the network.
Any data coming in from a file.
Any timestamps such as from GetTime.
Any data coming from standard in.

Any other FFI inputs are also recorded.

Then, when we start the program in "replaying mode", whenever the program attempts to call that FFI function, it will instead read from that recording file.

Undefined Behavior

For a language to have perfect replayability, it can't have any undefined behavior.

Most languages can't guarantee zero undefined behavior, but we achieved it in Vale. This is because of Vale's complete memory safety 4 and its Fearless FFI, which isolates all safe data from unsafe data using the FFI boundary.

See Fearless FFI: Memory Safety, Safer Dependencies, and Supply-Chain Attack Mitigation for more! 5

Remove Nondeterminism

There are some sources of nondeterminism we had to carefully avoid, when designing Vale.

For example, casting a pointer to an integer is nondeterministic in any language, because memory addresses are randomly determined at run-time, because of Address Space Layout Randomization. For now, we've removed that kind of casting in Vale. We may add it back sometime in the future, with a promising "deterministic mapping allocator" which compensates for ASLR.

Another source of nondeterminism is uninitialized memory. For example, when we allocate an array and read it out-of-bounds, it's impossible to know what data it will read. Luckily, Vale's generational references guarantee we can't read uninitialized memory.

Handle Multi-threading

Making a program deterministic when there's multi-threading is actually simpler than one might assume. Basically, we:

Make a recording file for each OS thread (not each virtual thread, which are already deterministically scheduled).
Record the "message ID" of every message it receives through a channel.
Record the "mutex lock count" of every mutex it opens.

If you'd like more details on how that works, see our internal notes, and feel free to swing by the discord server if you have any questions!

Note that multi-threading is not fully implemented yet, we mention it to show the direction we're heading.

Resilience

With the above measures, we find an amazing capability emerges: resilience.

We have "pure resilience", the ability to add pure function calls and printouts to our program, and be able to use the same recording. After all, it calls the same FFI functions in the same order, so why not?

We also have some "impure resilience", the ability to be able to refactor our program to a surprising extent, and still be able to use the same recording. If it calls the same FFI functions in the same order, then we can refactor quite a bit!

In fact, this resilience is what would separate Vale's perfect replayability from existing technologies like rr which just record a single execution, and dont allow modifying the source code.

Perfect Replayability + Random Testing

When making Incendian Falls for the 7DRL challenge, we had to move fast; we had to implement an entire game in only one week.

We used deterministic replayability 6 to help us find five bugs we wouldn't have found otherwise.

To do this, we:

Made a primitive invincible AI that would make the player take random actions each turn.
Ran the game thousands of times, overnight.

In the morning, we found that a lot of games had crashed. We replayed each one in the debugger, and reproduced the crash instantly.

We did this for each run, and root-caused the problems to five bugs.

If not for this technique, we might not have finished in time.

Perfect Replayability + Lock-Step Simulation

A lot of games already require perfect replayability. For example, most real-time strategy games require deterministic lock-step simulation which means:

Triggering zero undefined behavior.
Recording all inputs (and sharing them with other machines).
Removing all sources of non-determinism.

Vale has determinism built into the language itself, so Vale could be an amazing choice for multiplayer games.

Try it out!

You can use replayability today, by passing --enable_replaying true to the valec invocation.

Then, run your program with --vale_record recording.bin to record an execution to recording.bin.

After that, you can run your program again with --vale_replay recording.bin to make it run the exact same way. (We recommend running this in a debugger if you want to break on an error.)

Note that this is just a prototype. Some things don't work yet, such as passing mutable references over the FFI boundary. Feel free to report any bugs to the GitHub repo!

Limitations

There are a few limitations to this approach.

This can only replay Vale code. Because a replay doesn't actually call the FFI functions (it intercepts them and reads from a file instead) we won't reproduce a crash that happens in C code, for example.
Because we don't actually call the FFI functions during a replay, that means we likely won't see some output on the screen. For this, we plan to enable whitelisting:

The user could whitelist functions like println so they can see printouts.
The user could whitelist certain libraries that play will with replayability.

There is a slight slowdown, as we serialize all inputs from FFI. For some programs, this is negligible. For programs that communicate a lot with C, the overhead could be significant.

Still, for many cases, this is a small price to pay to completely eliminate heisenbugs and weeks-long investigative debugging sessions.

Conclusion

With perfect replayability, we made it so the language itself is not a source of nondeterminism, which allows us to be a lot more productive and spend a lot less time debugging.

We described this feature in pretty broad strokes! Those who want to read more on the implementation details are welcome to look at our internal designs for some more details.

In the coming weeks, we'll be writing more about "immutable calling" which helps eliminate memory safety overhead, so subscribe to our RSS feed, twitter, or the r/Vale subreddit, and come hang out in the Vale discord!

Notes [–] Notes [+] 3 4 5 6

Nondeterminism is like unpredictability, it's a source of data that's different every run. For example, a function that returns the current time will be different every run.

Because of Vale's generational references, it can offer memory safety without unsafe escape hatches and without garbage collection.

We might even have a way to offer unsafe blocks and still maintain perfect replayability, see our internal notes if you'd like to know more on this.

At the time, we actually used a subset of C#. Since C# doesn't have perfect replayability, it was tricky to figure out what libraries let nondeterminism creep in.