Currently, the general consensus is that you can't really run larger LLMs on ordinary computers. 0
But now, you can run an LLM much larger than your VRAM to answer a yes/no question every 5 seconds on average, for throughput use cases. 1
This is possible with batching and layer-wise inferencing from disk, where we stream an LLM's layers from the hard drive into VRAM, and run each layer against multiple in-progress prompts before moving on to the next layer.
This post is a bit AI-heavy, but as always, my trusty side notes will be explaining the more arcane concepts.
For fun, I spent an evening prototyping this into an AirLLM fork. 2
And it turns out, this isn't a new technique! Moritz Thüning used this technique months ago in their fltr tool, which is like grep but for natural language questions: ask it a human question and it will run that question against many documents at once. And even before that, Sheng et al wrote a paper about the technique and published their code as FlexGen, though unfortunately for me, it doesn't support my Mac.
Read on to hear the nonsense that led to this, or feel free to skip to the technical stuff!
In 1985, an unknown man threw a fish from the beaches of Florida all the way into Alabama. To some, this became known as "the greatest fish-based artillery assault of the 80s." 3 To the Alabamans, it probably became known as the "fish heard 'round the world", 4 and confirmation of Florida's ambitions of conquest.
Now, every April, Floridians and Alabamans flock to the Perdido Key beaches to reenact that glorious toss, in a charity event where they compete to see who can throw a fish the furthest into Alabama.
It is unknown whether anyone has beaten the record set back then, and it is unknown whether this will escalate into more advanced munitions in coming years like rubber fish, which are more aerodynamic and pack a lot more kinetic energy.
When I heard about the Florabama Mullet Toss, I knew I had to find more weird events like this. Surely, there must be more!
This is actually something LLMs are really good at: crawling the web and interpreting webpages.
You generally either get an out-of-memory error or it runs way too slowly.
This means cases where latency doesn't matter, where you're fine waiting a while to get a result.
Just Mac currently, sorry!
"Some" being me and a few of my buddies from the discord server.
A reference to the shot heard round the world
So a couple months ago, I made a little tool which basically asks ChatGPT and doublechecks via Google, basically a poor-man's retrieval-augmented generation approach:
It then takes the resulting list of event pages, and puts them on a webpage to ask me if they're weird enough and look legit.
However, this was a bit expensive. Even with various optimizations, I was racking up $10 per night in OpenAI spending. And this was just a tiny side-project, so it would be hard to justify spending thousands of dollars on GPUs to run models locally.
It was unfortunate that my Macbook Pro only had 16gb of RAM, otherwise I could run an LLM locally. Alas, I would need much more RAM to run a 70B LLM, 7 and the smaller LLMs just don't work as well.
It's clear that I need more RAM so I can run an LLM locally.
...or do I?
First, some background on how LLMs work!
Or other fun ones like "What are the weirdest annual llama-related events in the United States?"
This part was fun to figure out: I had to print it to a PDF first then run a text extractor on it.
A Q5-quantized 70B model would need 60gb of RAM, and Mac OS needs a lot of RAM for the OS, so we're talking a 64gb or 128gb Macbook Pro.
Basically, an LLM is divided into layers, like a cake. 8
For example, The 68-gigabyte model UNA-SimpleSmaug-34b-v1beta is actually 59 layers of 1.12 gigabytes each.
When an LLM does inferencing, 9 heres whats actually going on under the hood:
Or an onion, or an ogre.
"Inferencing" is predicting the next token in the LLM's response.
You can think of a "tensor" as a block of numbers, or rather, an N-dimensional array of numbers. An N=1 tensor is a regular array, an N=5 tensor is a 5d array, and an N=0 tensor is just a single number.
Here, each layer produces a block of numbers that is fed into the next layer. You can kind of think of it as an "in-progress thought".
Recently, tools like lyogavin's AirLLM (and others) have emerged, which allow us to lazily load layers from disk into VRAM. 11
After all, why load the entire huge model into VRAM, when we're only using a piece of it at a time? Let's just load whatever piece we need, right as we need it!
You could almost think of it as streaming layers into VRAM on-demand, and then quickly unloading to make room for the next layer.
This technique lets us run LLMs on much smaller VRAM. We only need about 2 gigabytes of VRAM now, not 68. 12
VRAM is your GPU's memory. An LLM can only read data that's in your VRAM.
The layer needs 1.12gb, and we also need a little space for the "in-progress thought" tensor.
Layer-wise inferencing makes it possible to run large models, but not very desirable to do so.
If a "token" is a word, then the average human can speak 100-150 tokens per minute, and ChatGPT speaks about 6000 tokens per minute.
AirLLM gives us... 2 tokens per minute. Very slow indeed.
...which makes sense, really. It needs to load 59 layers from disk for every single token. 13
Even for my use case, that's a bit too slow. Even if I only used AirLLM for only the yes-or-no questions (like "Does this event happen every year?"), it would only be able to doublecheck one possible event every 42 minutes. 14
This is roughly the consensus today: you can't really run large models on regular computers, it's just too slow.
You'd be slow too if you had to cycle through 59 layers every time you wanted to say a word.
30 seconds x 7 search results x 12 questions per result = 42 minutes.
The final approach would later involve adding batching to this "layer-wise inferencing from disk" approach.
So what's batching?
"Batching" is a strategy where we do inferencing on multiple prompts at the same time.
For example, we can ask our LLM these 5 questions at the same time:
And it turns each of those into an in-progress thought. We now have 5 in-progress thoughts at the same time.
Then, it runs all of those through Layer 1. Now we have 5 slightly-more-developed in-progress thoughts.
Then, we repeat 58 more times.
Then, we decode all 5 in-progress thoughts into our answers:
Batching already exists, of course! This is what llama.cpp's --parallel flag does, and what HuggingFace's Accelerate's batch_size parameter does.
It's also common practice to use batching when training a model, for example DeepSpeed's train_batch_size parameter.
However, those require that the entire model be loaded, into this machine's memory or distributed in other machines' memory.
And alas, I have only one machine, with relatively little VRAM.
So, it seemed I had a few options:
The last one definitely sounded the easiest.
In the end, it only took 123 lines and one evening. It's weekend-hackathon quality, but it worked!
And the results were pretty impressive.
Previously, it took 35.354 seconds per token, running one prompt.
By running 500 prompts at a time on each layer, we got down to 4.85 seconds, a 7x speedup.
More detailed benchmarks:
And the best part is that this works on my little 16gb Mac, I don't have to go out and buy a graphics card or something with a lot more memory.
If you want to do something like this, I recommend looking into FlexGen, which likely does this way better than my modified AirLLM did. It even uses some slick linear programming techniques to determine the best batch sizes.
On top of that, AirLLM leaves some performance on the table: it doesn't load the next layer ahead of time, even if there's room for it. It waits until inferencing is done. If it loaded the next layer while we did inferencing on this one, we could get those faster times (like 4.85s) on much smaller batch sizes than 500. 15
AirLLM also has a block-wise quantization option for 3x speedup, but it doesn't work on Mac and probably won't until bitsandbytes better supports Mac. This looked promising for a speedup, but I actually suspect it won't help us here.
Honestly, this doesn't really change anything. I just added an existing technique to an existing tool so I could do a thing on my Mac.
However, what most people don't know, and the main reason I wrote a whole post about this journey, is to show people that normal computers can run larger LLMs with reasonable throughput for some use cases.
This technique unblocks a lot of interesting uses, like:
These are all latency-agnostic use cases that can use this approach, without needing hundreds or thousands of dollars of GPUs.
Hopefully, it's only a matter of time before libraries better support this use case!
I haven't tried putting any of this on a Raspberry PI yet, but it would be pretty awesome.
Thanks for reading! I hope you enjoyed this post. It was a wild ride, and I'm glad I get to share it with you all.
Donations and sponsorships are currently paused, but if you like these articles, please Donate to Kākāpō Recovery and let me know! I love those birds, let's save them!
Cheers,
- Evan Ovadia