In this article, I'll explain how we can add multi-threading to our code with a single keyword, with no risk of data races!
TL;DR:
Note that this is a theoretical feature, and not implemented yet; this article is just to give a hint at the direction Vale is going.
If you've never used it, believe me when I say that structured concurrency is a lot of fun. 0
With structured concurrency, you can launch multiple threads at once, to run your calculations in parallel, for a big speedup. And often, it can be done with a tiny adjustment.
Imagine we had this simple program that calculated x^3 for some values of x:
#include <math.h>
#include <stdio.h>
void main() {
int exponent = 3;
int results[5];
// Calculate some powers
for (int i = 0; i < 5; i++) {
results[i] = pow(i, exponent);
}
// Print out the results
for (int i = 0; i < 5; i++) {
printf("%d to the %d'th power is %d!\n", i, exponent, results[i]);
}
}
0^3 is 0!
1^3 is 1!
2^3 is 8!
3^3 is 27!
4^3 is 64!
If pow was much more expensive, we might want to run those calculations in parallel. 1
It can be pretty tedious to add threading to C code. We'd have to wrap our results[i] = pow(i, exponent); line in an entire new function, and use pthread_create, like so:
#include <math.h>
#include <stdio.h>
#include <pthread.h>
typedef struct {
int exponent;
int i;
int* results;
} MyThreadArgs;
void *my_thread_main(void* args_raw) {
MyThreadArgs* args = (MyThreadArgs*)args_raw;
int exponent = args->exponent;
int i = args->i;
int* results = args->results;
results[i] = pow(i, exponent); // Some expensive calculation
free(args_raw);
return NULL;
}
int main() {
int exponent = 3;
int results[5];
// Don't run each iteration one after the other...
// run them in parallel, rather than serially.
pthread_t threads[5];
for (int i = 0; i < 5; i++) {
MyThreadArgs* args = (MyThreadArgs*)malloc(sizeof(MyThreadArgs));
args->exponent = exponent;
args->i = i;
args->results = results;
pthread_create(&threads[i], NULL, my_thread_main, args);
}
// Join the threads
for (int i = 0; i < 5; i++) {
void* returnval = NULL;
pthread_join(threads[i], &returnval);
}
for (int i = 0; i < 5; i++) {
printf("%d to the %d'th power is %d!\n", i, exponent, results[i]);
}
return 0;
}
That's a lot of code to just run results[i] = pow(i, exponent); in parallel!
With structured concurrency, we can do that with just one line, using OpenMP. Let's add a #pragma omp parallel for to our original program:
#include <math.h>
#include <stdio.h>
#include <omp.h>
int main() {
int exponent = 3;
int results[5];
// Launch some threads and run in parallel!
#pragma omp parallel for
for (int i = 0; i < 5; i++) {
results[i] = pow(i, exponent); // some expensive calculation
}
for (int i = 0; i < 5; i++) {
printf("%d to the %d'th power is %d!\n", i, exponent, results[i]);
}
return 0;
}
This is structured concurrency. It runs our iterations in parallel, and makes sure the parallel iterations are complete before continuing on.
Nathaniel Smith makes a great case that we should use structured concurrency rather than directly using thread APIs such as pthread_create, go, asyncio.create_task, etc. 2 Structured concurrency isn't a silver bullet, of course, but it's definitely a big step forward.
My mind was blown the first time I used it with OpenMP. I had no idea that parallelism could be so easy!
And years later, I used CUDA to make a raytracer, which was similarly mind-boggling. They even appear to have lambda support now, which means we basically have GPU structured concurrency now. Incredible!
We only use 5 ints and the pow function as a simple example. In practice, we use threads for much larger data sets.
In fact, this toy example will probably be a lot slower, because threads have their own performance overhead, and because of false sharing. Threads tend to pay off for much larger data sets.
OpenMP is a really amazing tool for C structured concurrency because it's seamless:
In other words, seamless concurrency is the ability to read existing data concurrently without refactoring existing code.
This seamlessness is important because it saves us time, and we can more easily experiment with adding concurrency to more places in our program.
Some other implementations of concurrency aren't seamless, but they have other benefits. For example, fearless concurrency is immune to data race bugs. Read on to find out what that means, and how we can prevent them!
Concurrency is "fearless" if data races are impossible. So what's a data race, and why do we want to avoid them?
A data race is when: 3
Let's add a "progress" counter to our above snippet, and see a data race bug in action:
#include <math.h>
#include <stdio.h>
#include <omp.h>
int main() {
int exponent = 3;
int results[5];
int numIterationsCompleted = 0;
// Launch some threads and run in parallel!
#pragma omp parallel for
for (int i = 0; i < 5; i++) {
results[i] = pow(i, exponent); // some expensive calculation
// Read, add, and store
numIterationsCompleted = numIterationsCompleted + 1;
printf("Completed %d iterations!\n", numIterationsCompleted);
}
for (int i = 0; i < 5; i++) {
printf("%d to the %d'th power is %d!\n", i, exponent, results[i]);
}
return 0;
}
Completed 1 iterations!
Completed 2 iterations!
Completed 3 iterations!
Completed 4 iterations!
Completed 4 iterations!
0 to the 3'th power is 0!
1 to the 3'th power is 1!
2 to the 3'th power is 8!
3 to the 3'th power is 27!
4 to the 3'th power is 64!
Look at that! There are two Completed 4 iterations! printed out. 5 Why is that?
It's because the line numIterationsCompleted = numIterationsCompleted + 1; has three steps:
If two threads are running in parallel, we might see this happen:
The problem is that they didn't see each others add operations; they're both adding 1 to the old value 3, to get 4.
If instead the 4th thread's store happened before the 5th thread's load, we'd have gotten the correct answer 5.
This is a data race bug. They are very difficult to detect, because they depend on the scheduling of the threads, which is effectively random.
Luckily, we've figured out how to avoid these problems, with some "concurrency ground rules":
We usually need proper fear, respect, and discipline to stick to these rules. But when a language enforces these rules, we have fearless concurrency.
From The Rustonomicon.
I include this "[and tasks]" because concurrency can have data races too. Python uses concurrency (not parallelism) and can have data races, such as in this example.
If this doesn't happen for you, try running it a few more times. It's rather fickle.
Plenty of languages 6 offer fearless concurrency, but we'll look at Pony and Rust here.
Pony doesn't let us access data from the outside scope, like in our above C examples. Instead, we "send" data between "actors", which are similar to threads.
We can send data if either of these is true:
If we have an val or iso reference to an object, we can send it to another actor.
Pony has fearless concurrency because in this system, data races are impossible; we can never have a modifiable chunk of data that's visible to multiple threads.
Key takeaways from Pony:
We'll use these techniques below!
See Renato Athaydes' Fearless concurrency: how Clojure, Rust, Pony, Erlang and Dart let you achieve that.
A "permission" is something that travels along with a pointer at compile time, which affects how you may use that pointer. For example, C++'s const or the mut in Rust's &mut
This is also true of Clojure. It's sometimes true with Rust; only some immutable borrows can be shared with multiple threads.
Rust has fearless concurrency that feels a lot like Pony's:
And because Rust's has async lambdas, it can also sometimes 9 do structured concurency!
use async_scoped::*;
#[async_std::main]
async fn main() {
Scope::scope_and_block(|s| {
for _ in 0..5 {
s.spawn((|| async {
println!("Running task!");
})());
}
});
}
Running task!
Running task!
Running task!
Running task!
Running task!
But alas, Rust doesn't have seamless structured concurrency, because it often can't access variables defined outside the task's scope. 10
There are some outstanding issues which prevent this from being generally usable, like sometimes requiring 'static lifetimes, blocking the running executor, or running unsafe code. Some very smart folks are working on these though, see tokio/1879 and tokio/2596.
Specifically, tasks can only capture things in the parent scope if they have Sync, see this example. Making data Sync to achieve concurrency is often not possible without potentially extensive rearchitecting of the containing program, so we can't quite say it's seamless. It's still pretty cool though!
We can make fearless and seamless structured concurrency by:
Our C program almost follows these rules, except it violates rule #2; remember how we modified the results array inside the parallel block:
int exponent = 3;
int results[5];
// Launch some threads and run in parallel!
#pragma omp parallel for
for (int i = 0; i < 5; i++) {
results[i] = pow(i, exponent);
}
But Vale has a foreach loop that can accumulate each iteration's result, and produce an array: 11
exported func main() {
exponent = 3;
results =
foreach i in range(0, 5) {
pow(i, exponent) 12
};
println(results);
}
There! The loop doesn't modify anything created outside the block.
We can now add a theoretical parallel keyword. 13
exported func main() {
exponent = 3;
results =
parallel foreach i in range(0, 5) {
pow(i, exponent)
};
println(results);
}
The parallel keyword will:
Since no threads modify the data, the data is truly temporarily immutable.
Since the data immutable, the threads can safely share it, as we saw in Pony.
We are now thread-safe!
And just because we can, let's modify something created inside the block:
exported func main() {
exponent = 3;
results =
parallel foreach i in range(0, 5) {
a = pow(i, exponent);
set a = a + 4; 15 // We can modify!
a
};
println(results);
}
You can see how the compiler can enforce that we only modify things that came from inside the block.
If we call a function, it needs to know what it can modify and what it can't. For that, we use r'. Note how in this snippet, blur's first parameter's type is &r'[][]int. 16
exported func main() {
// Makes a 2D array of ints
input = [](10, x => [](10, y => x * 10 + y));
results =
parallel foreach x in range(0, 10) {
parallel foreach y in range(0, 10) {
blur(input, x, y)
}
};
println(results);
}
func blur(input &r'[][]int, x int, y int) int {
... // Loop over the 3x3 ints around x,y in input and return the average
}
That r' means we're pointing into a read-only region. More on that below!
Key takeaways:
Vale is a language we're making to explore techniques for easy, fast, and safe code.
We actually discovered this "seamless, fearless, structured concurrency" while playing around with Vale's region borrow checker, which blends mutable aliasing with Rust's borrow checker.
The lack of a ; here means this line produces the block's result, similar to Scala and Rust.
To reiterate, this feature is theoretical, and we're still adding it to Vale. Stay tuned!
The parallel block has a default executor, but can be overloaded with parallel(num_threads) or parallel(an_executor).
set a = a + 4; is like a = a + 4; in C.
This is similar to Rust's lifetimes, more on that in the afterword.
As stated above, to enable fearless structured concurrency, we need to express that a reference points into a read-only region. 17 We can think of a region as a scope.
When we have a reference to an object in a read-only region, we:
We're using r' to specify that a reference is in a read-only region. For example, an &r'[]int is an non-owning reference to an array of integers inside a read-only region.
We can use these for our concurrency, with the following rule:
Elaborating on the r' a little more:
This may sound familiar from C++ 18 or Rust 19 20, but they don't have anything quite like it.
In Vale, regions can be mutable or immutable, but we "see" them as read-write or read-only. We can see both immutable and mutable regions as read-only. (This is similar to * and const * in C++.)
The user doesn't need to know about this distinction; if they ignore regions altogether the program's behavior will still make sense and be predictable.
A const * is similar, but we might not get a const * when we load a member/element from it.
An immutable borrow (&) is similar, except those can't always be shared with multiple threads. Only objects with Sync can be shared with multiple threads.
The afterword goes more into the difference between Vale's regions and Rust's lifetimes.
There are a couple drawbacks:
As we saw, we can combine...
...into a new and interesting fearless, seamless, structured concurrency feature.
And in Part 2, we'll enable fearlessly sharing mutable data too, using channels, isolated subgraphs, "region-shifting", and mutexes. Stay tuned!
Thanks for reading! If you want to discuss more, come by the r/Vale subreddit or the Vale discord, and if you enjoy these articles or want to support our work on Vale, you're welcome to sponsor us on GitHub!
This means no "immutability escape hatches" such as RefCell; immutable must really mean immutable.
Though, this might be good practice anyway.
The r' seems to hint that they're similar! Let's take another look:
func blur(input &r'[][]int, x int, y int) int {
... // Loop over the 3x3 ints around x,y in input and return the average
}
There are some similarities:
The biggest difference is that Vale lifts mutability to the region level:
The compiler enforces that nobody changes objects in immutable regions by:
Perhaps the biggest difference is that Vale has no aliasing restrictions.
Because of these design decisions, Vale's regions are largely "opt-in":
We hope that these things will give Vale a gradual learning curve, and make life easy for newcomers.
This is a pretty high-level overview of the differences, feel free to swing by our discord and we'd be happy to explain it!
Though, we can have structs in one temporary region pointing into structs in another region, like 'r MyStruct<'b>.
A function that accepts a read-only region will actually be generated twice: once with and once without the assumption that the read-only region is immutable and can therefore take advantage of the immutability optimizations.
This is because Vale gets its memory safety from Generational References, and soon possibly Hybrid-Generational Memory