The Impossible Optimization, and the Metaprogramming To Achieve It

October 27, 2025 — Evan Ovadia

Every once in a while, you come across an optimization that is so mind-blowingly weird, that you feel like you're peering into some sort of alternate dimension.

Until recently, I believed that metaprogramming is basically just normal C++ templates with a few neat features like constexpr thrown on top. I thought it was just "writing code that writes code", like we've been doing with PHP since the dawn of time. Simple stuff, right?

If you believe that like I did, buckle up! I'm going to show you how metaprogramming can be used to unlock what I call the impossible optimization.

And I don't just mean little 5% optimizations. I'm talking about crazy optimizations that make your code ten times faster.

Big shout-out to Alexandros Naskos's ctregex Zig library, which is a great example of what metaprogramming can do, and inspired the examples in this post.

Side Notes

(interesting tangential thoughts)

The Impossible Optimization

Imagine that you had a program using a simple regex function, and you wanted to use it to verify emails: 0

fn main():
    email_regex =
        Regex("\w+(\+\w*)?@(\d+\.\d+\.\d+\.\d+|\w+\.\w+)")

    print(matches(email_regex, "user@example.com"))      # Prints True
    print(matches(email_regex, "uexample.com"))          # Prints False
    print(matches(email_regex, "user@ecom"))             # Prints False
    print(matches(email_regex, "user+tag@example.com"))  # Prints True
    print(matches(email_regex, "user@100"))              # Prints False
    print(matches(email_regex, "howdy123@1.2.3.4"))      # Prints True
    print(matches(email_regex, "howdy1231.2.3.4"))       # Prints False
    print(matches(email_regex, "howdy123@1/2/3/4"))      # Prints False

We all know that regex is pretty slow, because it basically has to run a whole interpreter. Or, if we "precompile" a Regex, it would be interpreting an AST (or a DFA), which is still pretty slow. 1

When you really need speed, you skip regex entirely and make a hand-written matches function like this:

                            

black; padding: 16px; font-size: 80%; overflow-x: auto;">fn matches(text: String) -> Bool: var total_consumed = 0 var pos = 0 var text_len = len(text) var word_matches = 0 var word_total_consumed = 0 while True: if pos >= text_len: break var ch = text[pos] alias char_class = "word" var char_matches = False char_matches = ( (ord("a") <= ord(ch) <= ord("z")) or (ord("A") <= ord(ch) <= ord("Z")) or (ord("0") <= ord(ch) <= ord("9")) or ch == "_" ) if not char_matches: break var chars_consumed = 1 if chars_consumed == 0: break word_matches += 1 word_total_consumed += chars_consumed pos += chars_consumed if word_matches < 1: return False total_consumed += word_total_consumed ...  # much more code, for the rest of the pattern

This is about 10.5x faster in my benchmarks. 2

It makes you wonder, can an optimizer somehow do this for us?

Unfortunately, no. There are a lot of challenges in the way, as I'll explain below.

Notes [–] Notes [+] 0 1 2

0

...which is a TERRIBLE idea and you should never do it (see this page) but what's life without a little regex-related sin!

1

Real regex engines typically parse patterns into ASTs first, then compile them to state machines (DFAs/NFAs) that process input with state transitions. This toy implementation uses a recursive AST interpreter instead, which has function call overhead at every node, but it much better illustrates the powers of this technique: all the overhead is eliminated.

2

Tested on M2 Macbook Pro, run 200,000,000 times, more details below.

The Challenge

The problem is that matches is kind of like an interpreter. It takes in a constant tree of nodes, like this:

SequenceNode for \w+(\+\w*)?@(\d+\.\d+\.\d+\.\d+|\w+\.\w+)

RepeatNode for \w+

CharClassNode for \w

RepeatNode for (\+\w*)?

SequenceNode for \+\w*

LiteralNode for \+
RepeatNode for \w*

CharClassNode for \w

LiteralNode for @
OrNode for (\d+\.\d+\.\d+\.\d+|\w+\.\w+)

SequenceNode for \d+\.\d+\.\d+\.\d+

RepeatNode for \d+

CharClassNode for \d

LiteralNode for \.
RepeatNode for \d+

CharClassNode for \d

LiteralNode for \.
RepeatNode for \d+

CharClassNode for \d

LiteralNode for \.
RepeatNode for \d+

CharClassNode for \d

SequenceNode for \w+\.\w+

RepeatNode for \w+

CharClassNode for \w

LiteralNode for \.
RepeatNode for \w+

CharClassNode for \w

...and it also takes in the user's string which is only known at run-time.

It then navigates up and down the tree according to what character is next in the user's string.

Actual regex implementations are a lot more sophisticated, and use state machines instead of an AST like this. This is my own simplistic Regex subset implementation made just for this post. 3

To illustrate, here's the top-level matches function:


@no_inline
fn matches(regex: Regex, text: String) -> Bool:
    var result = _match_node(regex.nodes, regex.root_idx, text, 0)
    return result.matched and result.chars_consumed == len(text)

...which really just calls into the (recursive) _match_node function:

fn _match_node(
    nodes: List[RegexNode], node_idx: Int, text: String, start_pos: Int
) -> MatchResult:
    var node = nodes[node_idx]
    if node.isa[LiteralNode]():
        ref literal_node = node[LiteralNode]
        return _match_literal(nodes, literal_node, text, start_pos)
    elif node.isa[RepeatNode]():
        ref repeat_node = node[RepeatNode]
        return _match_repeat(nodes, repeat_node, text, start_pos)
    elif ...

...which calls various functions, like the below _match_repeat function. This function checks the repeating parts of the regular expression, like how how email regex's \w+ uses + to repeat the \w.

fn _match_repeat(
    nodes: List[RegexNode],  # All the regex nodes
    node: LiteralNode,       # Which node we're currently on
    text: String,            # User's string
    start_pos: Int           # Where in the user's string we're at
) -> MatchResult:
    var matches = 0
    var total_consumed = 0
    var pos = start_pos
    while True:
        if node.maximum_times >= 0 and matches >= node.maximum_times:
            break
        var result = _match_node(nodes, node.repeated, text, pos)
        if not result.matched:
            break
        if result.chars_consumed == 0:
            break
        matches += 1
        total_consumed += result.chars_consumed
        pos += result.chars_consumed
    if matches >= node.minimum_times:
        return MatchResult(True, total_consumed)
    else:
        return MatchResult(False, 0)

One benefit of the recursive approach is that it's very flexible. It can execute an arbitrary tree of nodes; it can execute an arbitrary regular expression. The hand-written version doesn't have that flexibility, it's coded up-front.

But of course, the flexibility has a cost: recursion requires more function calls than the hand-written version, and function calls have overhead. Every time we call a function, that's pushing and popping on the stack. The hand-written version is faster because it doesn't have that problem.

Another problem is that this general regex code has extra flexibility that the hand-written one doesn't need. Look at this condition from above:

    if node.maximum_times >= 0 and matches >= node.maximum_times:
        break

This code is only relevant when we want a maximum number of matches, as if we said \w{,7} to mean maximum 7 matches. But our particular email regex has things like \w+ with no upper limit, to match any number of \w characters. So it's unfortunate that this useless code is in here, slowing things down. The hand-written version is faster because it doesn't have that problem.

The hand-written version is also theoretically more optimizable because everything is in one function, as if we inlined all these recursive _match_node, _match_repeat, etc calls. Optimizers generally do better when everything is in one function.

Is there a way to get these benefits for our general Regex code?

Notes [–] Notes [+] 3

3

I had to hold myself back from adding even more interesting features to it, like SIMD. This implementation also doesn't support backtracking, which real regex engines need for more complex patterns.

Can we inline recursive calls?

Above, I said that the hand-written function is almost as if we inlined all these recursive _match_node calls.

So the obvious question: can we inline _match_node?

Let's try it, by throwing a @always_inline on it:

@always_inline
fn _match_node(
    nodes: List[RegexNode], node_idx: Int, text: String, start_pos: Int
) -> MatchResult:
    ...

But of course, trying to inline a recursive call causes a compile error:

main_rt_inlined.mojo:19:4: error: function has recursive call to 'always_inline' function
fn _match_node(
   ^
main_rt_inlined.mojo:28:29: note: through call here
        return _match_repeat(nodes, repeat_node, text, start_pos)
                            ^
main_rt_inlined.mojo:119:4: note: to function marked 'always_inline' here
fn _match_repeat(
   ^
main_rt_inlined.mojo:128:33: note: call here recurses
        var result = _match_node(nodes, node.repeated, text, pos)
                                ^
main_rt_inlined.mojo:19:4: note: back to function here
fn _match_node(
   ^

So let's table this line of thinking for now. We might come back to it later.

The Sorcery

I promised something mind-blowingly weird, so let's make things weird.

We're going to do some multi-stage programming to get similar effects to the first Futamura projection, 4 or as it's known to some, "the three steps of Futamura sorcery". 5

We'll:

Parse the regular expression and create the Regex at compile-time.
Make it a compile-time parameter, and use compile-time operations.
Inline.

...and then something amazing happens.

Notes [–] Notes [+] 4 5

4

Specifically, we're specializing an interpreter for a specific program, though we'll use a few explicit staging annotations (alias and square brackets) rather than automatic partial evaluation.

5

By "some", I mean like three people. Two if you don't count me.

Step 1: Parse the Regex at compile-time

Most Regex implementations first parse the regular expression into a list of nodes. If we could move that parsing to compile time, we can save some run-time.

To parse the Regex into nodes at compile time, we simply need to change the var regex to alias regex.

Before:

# Parse the regex (at runtime)
var regex = Regex("w+(+w*)?@(d+.d+.d+.d+|w+.w+)")
regex.match(regex.nodes, regex.root_node, user_string, 0)

After:

# Parse the regex (at compile time)
alias regex = Regex("w+(+w*)?@(d+.d+.d+.d+|w+.w+)")
_match_node[regex.nodes, regex.root_node](user_string, 0)

alias is Mojo-speak for "compile-time variable". 6

"But wait Evan, that's a lot of code to run at compile time, can compilers really execute general code at compile-time?"

They can! This is one of the nicer features of Mojo (and a few other languages like D, Nim, Zig, etc.). A simple keyword (alias) can lift almost any computation 7 to compile time, and the same code can be run at compile-time or run-time depending on how it's used (like var vs alias).

In this case, the compiler is running a whole little parser, with its own recursion, heap allocation, all sorts of stuff. And now it's happening at compile-time.

"Wait Evan, did you say heap allocation? Isn't heap allocation at compile-time impossible?"

It's rare, but not impossible! Mojo, D, Nim, and Zig can do it, and C++ as of C++20. There are likely some other languages that can do it, but these are the only ones that can truly run normal run-time code at compile time, as opposed to requiring compile-time calculations to be expressed in a different dialect. Though, I'd bet that some Lisps were doing all this back in the 60s and 70s too, as is usually the case!

Deep inside the Mojo compiler, there's a compile-time interpreter that works closely with the instantiator (the "elaborator", which monomorphizes all generics), which runs all expressions inside generic arguments ([...]) and inside alias statements. This compile-time interpreter can execute pretty much any operation that a normal CPU could execute, including malloc and free.

Notes [–] Notes [+] 6 7

6

I kind of wish Mojo had a comptime keyword like Zig, but alas!

7

I say almost because there are things a language shouldn't allow at compile time, like reading from stdin, etc.

Step 2: Make Regex a compile-time parameter, use compile-time operations

We just created this Regex at compile time.

Do we read it at run-time? Do we put it into a global or something?

No! We're going to only read it at compile-time.

"But Evan, we need to read the Regex at run-time! Otherwise, how does the run-time program even know what to do? How does it know what functions to call, what branches to take, etc.?"

Great question! The answer is that we can make all those decisions at compile time.

I know that doesn't make any sense, but bear with me.

Here's how we change our program.

Before:

fn _match_node(
    nodes: List[RegexNode],
    node_idx: Int,
    text: String,
    start_pos: Int
) -> MatchResult:
    var node = nodes[node_idx]
    if node.isa[LiteralNode]():
        ref literal_node = node[LiteralNode]
        return _match_literal(nodes, literal_node, text, start_pos)
    elif node.isa[RepeatNode]():
        ref repeat_node = node[RepeatNode]
        return _match_repeat(nodes, repeat_node, text, start_pos)
    elif ...

After:

fn _match_node[nodes: List[RegexNode], node_idx: Int](
    text: String,
    start_pos: Int
) -> MatchResult:
    alias node = nodes[node_idx]
    @parameter
    if node.isa[LiteralNode]():
        alias literal_node = node[LiteralNode]
        return _match_literal[nodes, literal_node](text, start_pos)
    elif node.isa[RepeatNode]():
        alias repeat_node = node[RepeatNode]
        return _match_repeat[nodes, repeat_node](text, start_pos)
    elif ...

There are two main changes:

The nodes and node_idx run-time parameters are now compile-time parameters, because they're inside [...]. We're using Mojo's generics system to hold these values at compile-time.
The @parameter if. @parameter means "at compile time" in Mojo-speak, so this if is run at compile-time, not run-time. This is like a constexpr if in C++ terms, or in C terms it's more like #define than if.

That means _match_node is now a generic function, a template.

There's only one nodes value (from the Regex), but there are 28 different node_idxs for this Regex. That means we'll have 28 different instantiations of this _match_node, one for each node_idx.

Previously, we had a very simple call tree, with only one _match_node, that was recursive. The call tree looked like this:

main

match

_match_node

_match_node

_match_node

_match_node

_match_node

... and so on

Now, we have 28 versions of _match_node, with a call tree that looks like this:

main

match[Regex instance]

_match_node[List(...), 0] (for whole expr)

_match_node[List(...), 1] (for first \w+)

_match_node[List(...), 2] (for \w)

_match_node[List(...), 3] (for (\+\w*)?)

_match_node[List(...), 4] (for \+\w*)

_match_node[List(...), 5] (for \+)
_match_node[List(...), 6] (for \w*)

_match_node[List(...), 7] (for \w)

_match_node[List(...), 8] (for @)
_match_node[List(...), 9] (for (\d+\.\d+\.\d+\.\d+|\w+\.\w+))

_match_node[List(...), 10] (for \d+\.\d+\.\d+\.\d+)

_match_node[List(...), 11] (for \d+)

_match_node[List(...), 12] (for \d)

_match_node[List(...), 13] (for \.)
_match_node[List(...), 14] (for \d+)

_match_node[List(...), 15] (for \d)

_match_node[List(...), 16] (for \.)
_match_node[List(...), 17] (for \d+)

_match_node[List(...), 18] (for \d)

_match_node[List(...), 19] (for \.)
_match_node[List(...), 20] (for \d+)

_match_node[List(...), 21] (for \d)

_match_node[List(...), 22] (for \w+\.\w+)

_match_node[List(...), 23] (for \w+)

_match_node[List(...), 24] (for \w)

_match_node[List(...), 25] (for \.)
_match_node[List(...), 26] (for \w+)

_match_node[List(...), 27] (for \w)

28 versions, each different only in their node_idx compile-time parameter.

To understand that better, let's look at the _match_node[List(...), 1] function for the first \w+.

When the compiler instantiates it as _match_node[List(...), 1], it would look like below. The @parameter if is evaluated at compile time, so only one of its branches is included. I'll leave them in as comments to illustrate:

fn _match_node[List(...), 1](  # Node 1 is a RepeatNode
    text: String,
    start_pos: Int
) -> MatchResult:
#   alias node = nodes[node_idx]
#   @parameter
#   if node.isa[LiteralNode]():
#       alias literal_node = node[LiteralNode]
#       return _match_literal[nodes, literal_node](text, start_pos)
#   elif node.isa[RepeatNode]():
#       alias repeat_node = node[RepeatNode]
        return _match_repeat[List(...), RepeatNode(2, 1, 0)](text, start_pos)
#   elif ...

The benefit here is that we don't have to do the node.isa[LiteralNode], node.isa[RepeatNode], etc. at run-time. That's pretty nice, it eliminates a bit of overhead.

It also executed the alias statements for us too, so we don't have to do those lookups (and bounds checks) at run-time. Another nice speedup!

We'll see the biggest benefit in the next section though.

Step 3: Inline

Let's look at the call tree again:

main

match[Regex instance]

_match_node[List(...), 0] (for whole expr)

_match_node[List(...), 1] (for first \w+)

_match_node[List(...), 2] (for \w)

_match_node[List(...), 3] (for (\+\w*)?)

_match_node[List(...), 4] (for \+\w*)

_match_node[List(...), 5] (for \+)
_match_node[List(...), 6] (for \w*)

_match_node[List(...), 7] (for \w)

_match_node[List(...), 8] (for @)
_match_node[List(...), 9] (for (\d+\.\d+\.\d+\.\d+|\w+\.\w+))

_match_node[List(...), 10] (for \d+\.\d+\.\d+\.\d+)

_match_node[List(...), 11] (for \d+)

_match_node[List(...), 12] (for \d)

_match_node[List(...), 13] (for \.)
_match_node[List(...), 14] (for \d+)

_match_node[List(...), 15] (for \d)

_match_node[List(...), 16] (for \.)
_match_node[List(...), 17] (for \d+)

_match_node[List(...), 18] (for \d)

_match_node[List(...), 19] (for \.)
_match_node[List(...), 20] (for \d+)

_match_node[List(...), 21] (for \d)

_match_node[List(...), 22] (for \w+\.\w+)

_match_node[List(...), 23] (for \w+)

_match_node[List(...), 24] (for \w)

_match_node[List(...), 25] (for \.)
_match_node[List(...), 26] (for \w+)

_match_node[List(...), 27] (for \w)

Note how these are not actually recursive calls anymore. The compiler doesn't see these as recursive, it sees them as completely different functions, because they are.

As far as the compiler is concerned, this is a normal call tree, not a potentially infinite call tree.

So... maybe we can inline them? Can we put @always_inline on these functions?

Yes we can!

Let's start by inlining the _match_repeat function. We do that by putting @always_inline on fn _match_repeat:

@always_inline
fn _match_repeat[
    nodes: List[RegexNode], repeat_node: RepeatNode
](text: String, start_pos: Int) -> MatchResult:
    var matches = 0
    var total_consumed = 0
    var pos = start_pos
    while True:
        if node.maximum_times >= 0 and matches >= node.maximum_times:
            break
        var result = _match_node[nodes, node.repeated](text, pos)
        ...

And now it will be inlined into its caller.

Remember our _match_node[List(...), 1] function?

fn _match_node[List(...), 5](  # Node 5 is a RepeatNode
    text: String,
    start_pos: Int
) -> MatchResult:
    alias node = nodes[node_idx]
    alias repeat_node = node[RepeatNode]
    # Let's inline this _match_repeat call first!
    return _match_repeat[nodes, repeat_node](text, start_pos)

Now that _match_repeat is inlined, _match_node looks more like this now:

fn _match_node[List(...), 1](  # Node 1 is a RepeatNode
    text: String,
    start_pos: Int
) -> MatchResult:
    alias node = nodes[node_idx]
    alias repeat_node = node[RepeatNode]
    # Inlined _match_repeat call
    var matches = 0
    var total_consumed = 0
    while True:
        if node.maximum_times >= 0 and matches >= node.maximum_times:
            break
        var result = _match_node[nodes, node.repeated](text, pos)
        ...
    ...

And the compiler will also execute those alias statements at compile time and inline their results.

This part:

if node.maximum_times >= 0 and matches >= node.maximum_times:
    break

actually becomes:

if RepeatNode(2, 1, -1)_times >= 0 and matches >= RepeatNode(2, 1, 0).maximum_times:
    break

which then becomes:

if -1 >= 0 and matches >= -1:
    break

which then becomes:

if False and matches >= -1:
    break

which is then completely eliminated, because it's impossible.

So in the end, our _match_node[List(...), 1] function is much simpler, and more specialized to this particular RepeatNode.

fn _match_node[List(...), 1](  # Node 1 is a RepeatNode
    text: String,
    start_pos: Int
) -> MatchResult:
    # No node lookup!
    var matches = 0
    var total_consumed = 0
    while True:
        # No useless maximum_times checks here!
        var result = _match_node[nodes, node.repeated](text, pos)
        ...
    ...

We got all that benefit from just one @always_inline on one function!

Let's throw @always_inline on all the functions:

_match_literal
_match_charclass
_match_or
_match_sequence
_match_repeat
_match_node

The Result

Suddenly, our large call tree above becomes simply this:

main, which calls:

match[Regex instance]

And looking at the generated IR, it was all folded down into one function, structured similarly to the hand-written version.

The final results (tested on an M2 Macbook Pro), run 200,000,000 times:

Normal recursive version (main_rt.mojo): 135.9 seconds
Metaprogrammed version (main_ct.mojo): 13.4 seconds
Hand-written version (13.0 seconds)

The metaprogrammed version is 10x faster than the normal recursive version, and within ~3% of the hand-written version.

Pretty impressive!

Could this go even faster than the hand-written version?

It could!

Regex libraries often do their own custom optimizations on the AST, before matching against the user's strings.

For example, the pattern (catastrophe|catapult|category) shares the prefix 'cat'. At compile-time, we could detect this and generate code that's more like cat(astrophe|apult|egory), saving some comparisons.

Or, for patterns like \b(GET|POST|PUT|DELETE)\b, it could figure out (at compile-time) a perfect hash table or trie, replacing 4 string comparisons with one hash lookup / string compare.

These are techniques you rarely see in hand-written versions, but are common in normal regex engines.

If we combine those with this technique, we could get those optimizations on top of hand-written performance.

The Downside

Let's talk about the main downside: compile times!

This takes much more time to compile, because the compiler is generating much more code. In this example, we made 28 _match_node[...] functions, each about a fifth of the size of the original run-time recursive _match_node. So this technique generated about 5x as much _match_node code.

And if you do custom optimizations like I mentioned in the last section, that also gets more performance at the cost of more compile time.

One should use this technique wisely, only on code that needs to run fast.

I should mention, I suspect drawback can be mitigated or avoided entirely. I think a user could theoretically use MaybeComptime and a few accompanying tools to have a dev/test mode where these calculations happen at run-time instead. Not proven, but my spidey sense says it's possible.

Metaprogramming

This is metaprogramming. It's not just templates and constexpr. It's moving entire aspects of your entire program into compile-time, doing custom optimizations so that it can work even better than hand-written code.

Metaprogramming is a term I've heard for many, many years, but I admit, I didn't truly understand its power and implications until seeing what it could do here.

The Implications

This regex example is just something most of us can relate to, but I think this technique can be used for much, much more.

This is just wild speculation, but I think you could use this for:

Compiling entire database queries
Fusing graphics shaders together or GPU kernels together
Eliminate redundant network fetches or file system reads, like how we eliminated those redundant regex sub-expressions
Eliminate redundant matrix operations (like two transposes next to each other)
Combining a map, reduce, and map all into one operation
Crazy data structures, like in-memory journaled databases / persistent data structures.

If I was to generalize:

Any time you are making decisions based on information that you know at compile time, you could apply this technique.
This lets you do custom, high-level optimizations.
If you ever have data describing computations, you could use this to remove that data layer, to remove that abstraction.

There's a few other directions I want to explore here:

If you squint, this is a kind of constant propagation, where a compiler is more aware of what's known at compile-time and what's not. And given that, I suspect we could combine it with the MaybeComptime concept described here and here (for working with data when you don't know if it's known at compile time) to take it one step further and let the compiler constant-propagate even more.
If we get can feed in more details about the target machine, like cache size, kind of CPU, kind of graphics or compute cards, etc., then we can specialize code according to those details, giving us code that's way, way faster. This is actually pretty convenient for Mojo, which is compiled on the user's machine, since it's part of the Python ecosystem.

If you have any thoughts on where this technique can go, let me know!

Conclusion

It turns out, the impossible optimization is totally possible!

All it took was a little metaprogramming sorcery.

If you want to see the code, you can find it here.

If you want to read more on this technique, search for first Futamura projection or "staged programming" which is a similar concept.

I hope you enjoyed this! Stay tuned for more articles by subscribing to the RSS feed, twitter, and come hang out in the Mojo discord or find me on the Vale discord.

Cheers!

- Evan