Every once in a while, you come across an optimization that is so mind-blowingly weird, that you feel like you're peering into some sort of alternate dimension.
Until recently, I believed that metaprogramming is basically just normal C++ templates with a few neat features like constexpr thrown on top. I thought it was just "writing code that writes code", like we've been doing with PHP since the dawn of time. Simple stuff, right?
If you believe that like I did, buckle up! I'm going to show you how metaprogramming can be used to unlock what I call the impossible optimization.
And I don't just mean little 5% optimizations. I'm talking about crazy optimizations that make your code ten times faster.
Imagine that you had a program using a simple regex function, and you wanted to use it to verify emails: 0
fn main():
email_regex =
Regex("\w+(\+\w*)?@(\d+\.\d+\.\d+\.\d+|\w+\.\w+)")
print(matches(email_regex, "user@example.com")) # Prints True
print(matches(email_regex, "uexample.com")) # Prints False
print(matches(email_regex, "user@ecom")) # Prints False
print(matches(email_regex, "user+tag@example.com")) # Prints True
print(matches(email_regex, "user@100")) # Prints False
print(matches(email_regex, "howdy123@1.2.3.4")) # Prints True
print(matches(email_regex, "howdy1231.2.3.4")) # Prints False
print(matches(email_regex, "howdy123@1/2/3/4")) # Prints False
We all know that regex is pretty slow, because it basically has to run a whole interpreter. Or, if we "precompile" a Regex, it would be interpreting an AST (or a DFA), which is still pretty slow. 1
When you really need speed, you skip regex entirely and make a hand-written matches function like this:
fn matches(text: String) -> Bool:
var total_consumed = 0
var pos = 0
var text_len = len(text)
var word_matches = 0
var word_total_consumed = 0
while True:
if pos >= text_len:
break
var ch = text[pos]
alias char_class = "word"
var char_matches = False
char_matches = (
(ord("a") <= ord(ch) <= ord("z"))
or (ord("A") <= ord(ch) <= ord("Z"))
or (ord("0") <= ord(ch) <= ord("9"))
or ch == "_"
)
if not char_matches:
break
var chars_consumed = 1
if chars_consumed == 0:
break
word_matches += 1
word_total_consumed += chars_consumed
pos += chars_consumed
if word_matches < 1:
return False
total_consumed += word_total_consumed
... # much more code, for the rest of the pattern
This is about 10.5x faster in my benchmarks. 2
It makes you wonder, can an optimizer somehow do this for us?
Unfortunately, no. There are a lot of challenges in the way, as I'll explain below.
...which is a TERRIBLE idea and you should never do it (see this page) but what's life without a little regex-related sin!
Real regex engines typically parse patterns into ASTs first, then compile them to state machines (DFAs/NFAs) that process input with state transitions. This toy implementation uses a recursive AST interpreter instead, which has function call overhead at every node, but it much better illustrates the powers of this technique: all the overhead is eliminated.
Tested on M2 Macbook Pro, run 200,000,000 times, more details below.
The problem is that matches is kind of like an interpreter. It takes in a constant tree of nodes, like this:
...and it also takes in the user's string which is only known at run-time.
It then navigates up and down the tree according to what character is next in the user's string.
To illustrate, here's the top-level matches function:
@no_inline
fn matches(regex: Regex, text: String) -> Bool:
var result = _match_node(regex.nodes, regex.root_idx, text, 0)
return result.matched and result.chars_consumed == len(text)
...which really just calls into the (recursive) _match_node function:
fn _match_node(
nodes: List[RegexNode], node_idx: Int, text: String, start_pos: Int
) -> MatchResult:
var node = nodes[node_idx]
if node.isa[LiteralNode]():
ref literal_node = node[LiteralNode]
return _match_literal(nodes, literal_node, text, start_pos)
elif node.isa[RepeatNode]():
ref repeat_node = node[RepeatNode]
return _match_repeat(nodes, repeat_node, text, start_pos)
elif ...
...which calls various functions, like the below _match_repeat function. This function checks the repeating parts of the regular expression, like how how email regex's \w+ uses + to repeat the \w.
fn _match_repeat(
nodes: List[RegexNode], # All the regex nodes
node: LiteralNode, # Which node we're currently on
text: String, # User's string
start_pos: Int # Where in the user's string we're at
) -> MatchResult:
var matches = 0
var total_consumed = 0
var pos = start_pos
while True:
if node.maximum_times >= 0 and matches >= node.maximum_times:
break
var result = _match_node(nodes, node.repeated, text, pos)
if not result.matched:
break
if result.chars_consumed == 0:
break
matches += 1
total_consumed += result.chars_consumed
pos += result.chars_consumed
if matches >= node.minimum_times:
return MatchResult(True, total_consumed)
else:
return MatchResult(False, 0)
One benefit of the recursive approach is that it's very flexible. It can execute an arbitrary tree of nodes; it can execute an arbitrary regular expression. The hand-written version doesn't have that flexibility, it's coded up-front.
But of course, the flexibility has a cost: recursion requires more function calls than the hand-written version, and function calls have overhead. Every time we call a function, that's pushing and popping on the stack. The hand-written version is faster because it doesn't have that problem.
Another problem is that this general regex code has extra flexibility that the hand-written one doesn't need. Look at this condition from above:
if node.maximum_times >= 0 and matches >= node.maximum_times:
break
This code is only relevant when we want a maximum number of matches, as if we said \w{,7} to mean maximum 7 matches. But our particular email regex has things like \w+ with no upper limit, to match any number of \w characters. So it's unfortunate that this useless code is in here, slowing things down. The hand-written version is faster because it doesn't have that problem.
The hand-written version is also theoretically more optimizable because everything is in one function, as if we inlined all these recursive _match_node, _match_repeat, etc calls. Optimizers generally do better when everything is in one function.
Is there a way to get these benefits for our general Regex code?
I had to hold myself back from adding even more interesting features to it, like SIMD. This implementation also doesn't support backtracking, which real regex engines need for more complex patterns.
Above, I said that the hand-written function is almost as if we inlined all these recursive _match_node calls.
So the obvious question: can we inline _match_node?
Let's try it, by throwing a @always_inline on it:
@always_inline
fn _match_node(
nodes: List[RegexNode], node_idx: Int, text: String, start_pos: Int
) -> MatchResult:
...
But of course, trying to inline a recursive call causes a compile error:
main_rt_inlined.mojo:19:4: error: function has recursive call to 'always_inline' function
fn _match_node(
^
main_rt_inlined.mojo:28:29: note: through call here
return _match_repeat(nodes, repeat_node, text, start_pos)
^
main_rt_inlined.mojo:119:4: note: to function marked 'always_inline' here
fn _match_repeat(
^
main_rt_inlined.mojo:128:33: note: call here recurses
var result = _match_node(nodes, node.repeated, text, pos)
^
main_rt_inlined.mojo:19:4: note: back to function here
fn _match_node(
^
So let's table this line of thinking for now. We might come back to it later.
I promised something mind-blowingly weird, so let's make things weird.
We're going to do some multi-stage programming to get similar effects to the first Futamura projection, 4 or as it's known to some, "the three steps of Futamura sorcery". 5
We'll:
...and then something amazing happens.
Specifically, we're specializing an interpreter for a specific program, though we'll use a few explicit staging annotations (alias and square brackets) rather than automatic partial evaluation.
By "some", I mean like three people. Two if you don't count me.
Most Regex implementations first parse the regular expression into a list of nodes. If we could move that parsing to compile time, we can save some run-time.
To parse the Regex into nodes at compile time, we simply need to change the var regex to alias regex.
Before:
# Parse the regex (at runtime)
var regex = Regex("w+(+w*)?@(d+.d+.d+.d+|w+.w+)")
regex.match(regex.nodes, regex.root_node, user_string, 0)
After:
# Parse the regex (at compile time)
alias regex = Regex("w+(+w*)?@(d+.d+.d+.d+|w+.w+)")
_match_node[regex.nodes, regex.root_node](user_string, 0)
alias is Mojo-speak for "compile-time variable". 6
"But wait Evan, that's a lot of code to run at compile time, can compilers really execute general code at compile-time?"
They can! This is one of the nicer features of Mojo (and a few other languages like D, Nim, Zig, etc.). A simple keyword (alias) can lift almost any computation 7 to compile time, and the same code can be run at compile-time or run-time depending on how it's used (like var vs alias).
In this case, the compiler is running a whole little parser, with its own recursion, heap allocation, all sorts of stuff. And now it's happening at compile-time.
"Wait Evan, did you say heap allocation? Isn't heap allocation at compile-time impossible?"
It's rare, but not impossible! Mojo, D, Nim, and Zig can do it, and C++ as of C++20. There are likely some other languages that can do it, but these are the only ones that can truly run normal run-time code at compile time, as opposed to requiring compile-time calculations to be expressed in a different dialect. Though, I'd bet that some Lisps were doing all this back in the 60s and 70s too, as is usually the case!
Deep inside the Mojo compiler, there's a compile-time interpreter that works closely with the instantiator (the "elaborator", which monomorphizes all generics), which runs all expressions inside generic arguments ([...]) and inside alias statements. This compile-time interpreter can execute pretty much any operation that a normal CPU could execute, including malloc and free.
I kind of wish Mojo had a comptime keyword like Zig, but alas!
I say almost because there are things a language shouldn't allow at compile time, like reading from stdin, etc.
We just created this Regex at compile time.
Do we read it at run-time? Do we put it into a global or something?
No! We're going to only read it at compile-time.
"But Evan, we need to read the Regex at run-time! Otherwise, how does the run-time program even know what to do? How does it know what functions to call, what branches to take, etc.?"
Great question! The answer is that we can make all those decisions at compile time.
I know that doesn't make any sense, but bear with me.
Here's how we change our program.
Before:
fn _match_node(
nodes: List[RegexNode],
node_idx: Int,
text: String,
start_pos: Int
) -> MatchResult:
var node = nodes[node_idx]
if node.isa[LiteralNode]():
ref literal_node = node[LiteralNode]
return _match_literal(nodes, literal_node, text, start_pos)
elif node.isa[RepeatNode]():
ref repeat_node = node[RepeatNode]
return _match_repeat(nodes, repeat_node, text, start_pos)
elif ...
After:
fn _match_node[nodes: List[RegexNode], node_idx: Int](
text: String,
start_pos: Int
) -> MatchResult:
alias node = nodes[node_idx]
@parameter
if node.isa[LiteralNode]():
alias literal_node = node[LiteralNode]
return _match_literal[nodes, literal_node](text, start_pos)
elif node.isa[RepeatNode]():
alias repeat_node = node[RepeatNode]
return _match_repeat[nodes, repeat_node](text, start_pos)
elif ...
There are two main changes:
That means _match_node is now a generic function, a template.
There's only one nodes value (from the Regex), but there are 28 different node_idxs for this Regex. That means we'll have 28 different instantiations of this _match_node, one for each node_idx.
Previously, we had a very simple call tree, with only one _match_node, that was recursive. The call tree looked like this:
Now, we have 28 versions of _match_node, with a call tree that looks like this:
28 versions, each different only in their node_idx compile-time parameter.
To understand that better, let's look at the _match_node[List(...), 1] function for the first \w+.
When the compiler instantiates it as _match_node[List(...), 1], it would look like below. The @parameter if is evaluated at compile time, so only one of its branches is included. I'll leave them in as comments to illustrate:
fn _match_node[List(...), 1]( # Node 1 is a RepeatNode
text: String,
start_pos: Int
) -> MatchResult:
# alias node = nodes[node_idx]
# @parameter
# if node.isa[LiteralNode]():
# alias literal_node = node[LiteralNode]
# return _match_literal[nodes, literal_node](text, start_pos)
# elif node.isa[RepeatNode]():
# alias repeat_node = node[RepeatNode]
return _match_repeat[List(...), RepeatNode(2, 1, 0)](text, start_pos)
# elif ...
The benefit here is that we don't have to do the node.isa[LiteralNode], node.isa[RepeatNode], etc. at run-time. That's pretty nice, it eliminates a bit of overhead.
It also executed the alias statements for us too, so we don't have to do those lookups (and bounds checks) at run-time. Another nice speedup!
We'll see the biggest benefit in the next section though.
Let's look at the call tree again:
Note how these are not actually recursive calls anymore. The compiler doesn't see these as recursive, it sees them as completely different functions, because they are.
As far as the compiler is concerned, this is a normal call tree, not a potentially infinite call tree.
So... maybe we can inline them? Can we put @always_inline on these functions?
Yes we can!
Let's start by inlining the _match_repeat function. We do that by putting @always_inline on fn _match_repeat:
@always_inline
fn _match_repeat[
nodes: List[RegexNode], repeat_node: RepeatNode
](text: String, start_pos: Int) -> MatchResult:
var matches = 0
var total_consumed = 0
var pos = start_pos
while True:
if node.maximum_times >= 0 and matches >= node.maximum_times:
break
var result = _match_node[nodes, node.repeated](text, pos)
...
And now it will be inlined into its caller.
Remember our _match_node[List(...), 1] function?
fn _match_node[List(...), 5]( # Node 5 is a RepeatNode
text: String,
start_pos: Int
) -> MatchResult:
alias node = nodes[node_idx]
alias repeat_node = node[RepeatNode]
# Let's inline this _match_repeat call first!
return _match_repeat[nodes, repeat_node](text, start_pos)
Now that _match_repeat is inlined, _match_node looks more like this now:
fn _match_node[List(...), 1]( # Node 1 is a RepeatNode
text: String,
start_pos: Int
) -> MatchResult:
alias node = nodes[node_idx]
alias repeat_node = node[RepeatNode]
# Inlined _match_repeat call
var matches = 0
var total_consumed = 0
while True:
if node.maximum_times >= 0 and matches >= node.maximum_times:
break
var result = _match_node[nodes, node.repeated](text, pos)
...
...
And the compiler will also execute those alias statements at compile time and inline their results.
This part:
if node.maximum_times >= 0 and matches >= node.maximum_times:
break
actually becomes:
if RepeatNode(2, 1, -1)_times >= 0 and matches >= RepeatNode(2, 1, 0).maximum_times:
break
which then becomes:
if -1 >= 0 and matches >= -1:
break
which then becomes:
if False and matches >= -1:
break
which is then completely eliminated, because it's impossible.
So in the end, our _match_node[List(...), 1] function is much simpler, and more specialized to this particular RepeatNode.
fn _match_node[List(...), 1]( # Node 1 is a RepeatNode
text: String,
start_pos: Int
) -> MatchResult:
# No node lookup!
var matches = 0
var total_consumed = 0
while True:
# No useless maximum_times checks here!
var result = _match_node[nodes, node.repeated](text, pos)
...
...
We got all that benefit from just one @always_inline on one function!
Let's throw @always_inline on all the functions:
Suddenly, our large call tree above becomes simply this:
And looking at the generated IR, it was all folded down into one function, structured similarly to the hand-written version.
The final results (tested on an M2 Macbook Pro), run 200,000,000 times:
The metaprogrammed version is 10x faster than the normal recursive version, and within ~3% of the hand-written version.
Pretty impressive!
It could!
Regex libraries often do their own custom optimizations on the AST, before matching against the user's strings.
For example, the pattern (catastrophe|catapult|category) shares the prefix 'cat'. At compile-time, we could detect this and generate code that's more like cat(astrophe|apult|egory), saving some comparisons.
Or, for patterns like \b(GET|POST|PUT|DELETE)\b, it could figure out (at compile-time) a perfect hash table or trie, replacing 4 string comparisons with one hash lookup / string compare.
These are techniques you rarely see in hand-written versions, but are common in normal regex engines.
If we combine those with this technique, we could get those optimizations on top of hand-written performance.
Let's talk about the main downside: compile times!
This takes much more time to compile, because the compiler is generating much more code. In this example, we made 28 _match_node[...] functions, each about a fifth of the size of the original run-time recursive _match_node. So this technique generated about 5x as much _match_node code.
And if you do custom optimizations like I mentioned in the last section, that also gets more performance at the cost of more compile time.
One should use this technique wisely, only on code that needs to run fast.
I should mention, I suspect drawback can be mitigated or avoided entirely. I think a user could theoretically use MaybeComptime and a few accompanying tools to have a dev/test mode where these calculations happen at run-time instead. Not proven, but my spidey sense says it's possible.
This is metaprogramming. It's not just templates and constexpr. It's moving entire aspects of your entire program into compile-time, doing custom optimizations so that it can work even better than hand-written code.
Metaprogramming is a term I've heard for many, many years, but I admit, I didn't truly understand its power and implications until seeing what it could do here.
This regex example is just something most of us can relate to, but I think this technique can be used for much, much more.
This is just wild speculation, but I think you could use this for:
If I was to generalize:
There's a few other directions I want to explore here:
If you have any thoughts on where this technique can go, let me know!
It turns out, the impossible optimization is totally possible!
All it took was a little metaprogramming sorcery.
If you want to see the code, you can find it here.
If you want to read more on this technique, search for first Futamura projection or "staged programming" which is a similar concept.
I hope you enjoyed this! Stay tuned for more articles by subscribing to the RSS feed, twitter, and come hang out in the Mojo discord or find me on the Vale discord.
Cheers!
- Evan