Just a note that you have to have at least 12GB VRAM for it to be worth even trying to use your GPU for LLaMA 2.
The 7B model quantized to 4 bits can fit in 8GB VRAM with room for the context, but is pretty useless for getting good results in my experience. 13B is better but still not anything near as good as the 70B, which would require >35GB VRAM to use at 4 bit quantization.
My solution for playing with this was just to upgrade my PC's RAM to 64GB. It's slower than the GPU, but it was way cheaper and I can run the 70B model easily.
Just a quick note, it's worth pointing out that for most people (eg, wanting to chat to a model in realtime), I don't think running locally on a CPU is a very viable option unless you're very patient. On my 16c Ryzen 5950X/64GB DDR4-3800 system, llama-2-70b-chat (q4_K_M) running llama.cpp (eb542d3) and testing doing a 100 token test (life's too short to try max context), I got 1.25 tokens/second (~1 word/second) output.
Compiled with cuBLAS w/ `-ngl 0` (~400MB of VRAM usage, no layers loaded) makes no perf difference. The max layers I can load on a headless 24GB 4090 is 45/83 layers (running `-ngl 45 --low-vram`) which brings speeds up to 2.5 t/s. A little less painful, but still not super pleasant. For reference, people have reported performance of 12-15 t/s with 2x4090s w/ exllama (GPTQ). People are using a 14,20 split and able to load a full (NTK Rope scaled) 16K context into 48GB of VRAM.
The benefits are primarily price - 96GB of VRAM would be 4x3090/4090 (~$6K) or 2xA6000 (~$8-14K) cards (also, looks like you can buy an 80GB A100 PCIe for about $15K atm). While Apple is using LPDDR5, it is also running a lot more channels than comparable PC hardware. The M2 has 100GB/s, M2 Pro 200GB/s, M2 Max 400GB/s, and M2 Ultra is 800GB/s (8 channel) of memory bandwidth. The Nvidia cards are about 900GB/s-1TB/s (A100 PCIe gets up to 1.5TB/s).
In practice, on quantizes of the larger open LLMs, an M2 Ultra can currently inference about 2-4X faster than the best PC CPUs I've seen (mega Epyc systems), but also about 2-4X slower than 2x4090s.
That is useful info, but still does not quite address the question.
The question was how memory type, memory amount and bandwidth factor into actual performance. So let me rephrase: Given a budget of $X, what performance/limitations should you expect with
- 256GB of non-unified DDR5 in a PC, just CPU
- 128GB of DDR5 for an APU
- 96GB of unified DDR5
- Whatever Nvidia will sell you for $X.
An answer of "just compare a single memory bandwidth number" seems a bit short. Sure, more bandwidth helps, but is half as much RAM at double bandwidth better or worse?
No idea, I just said I wanted to try this out and see how it performs.
Doesn’t VRAM amount limit the size of the model you can load? I’m not talking about training just inference. I also pointed out these are not the greatest GPUs available, just that the advantage they have is being able to address more memory since on those machines is a shared block between system and GPU.
Trying to figure out what hardware to convince my boss to spend on... if we were to get one of the A6000/48gb cards, will that see significant performance improvements over just a 4090/24gb? The primary limitation is vram, is it not?
VRAM is what gets you up to the larger model sizes, and 24GB isn't enough to load the full 70B even at 4 bits, you need at least 35 and some extra for the context. So it depends a lot on what you want to do—fine tuning will take even more as I understand it.
The card's speed will affect your performance, but I don't know enough about different graphics cards to tell you specifics.
How would an APU, such as 5700g with up to 128gb of system ram perform when allocating it as vram? Is this a cost effective way of using running this on a budget?
Well, 48gb is better than nothing at least. And it has the potential (if we get the build right) to drop a second A6000 card into it with the nvlink module (I think this does allow you to effectively have 96gb) later.
You might consider getting a Mac Studio (with as much RAM as you can afford up to 192GB) instead, since 192GB is more (unified) memory than you're going to easily get to with GPUs.
This. The main system memory on a Mac Studio is GPU memory and there's a lot of it.
It also has the Neural Engine, which is specifically designed for this type of work - most software isn't designed to take advantage of that yet, but presumably it will soon.
While on the surface, a 192GB Mac Studio seems like a great deal (it's not much more than a 48GB A6000!), there are several reasons why this might not be a good idea:
* I assume most people have never used llama.cpp Metal w/ large models. It will drop to CPU speeds whenever the context window is full: https://github.com/ggerganov/llama.cpp/issues/1730#issuecomm... - while sure this might be fixed in the future, it's been an issue since Metal support was added, and is a significant problem if you are actually trying to actually use it for inferencing. With 192GB of memory, you could probably run larger models w/o quantization, but I've never seen anyone post benchmarks of their experiences. Note that at that point, the limited memory bandwidth will be a big factor.
* If you are planning on using Apple Silicon for ML/training, I'd also be wary. There are multi-year long open bugs in PyTorch[1], and most major LLM libs like deepspeed, bitsandbytes, etc don't have Apple Silicon support[2][3].
You can see similar patterns w/ Stable Diffusion support [4][5] - support lagging by months, lots of problems and poor performance with inference, much less fine tuning. You can apply this to basically any ML application you want (srt, tts, video, etc)
Macs are fine to poke around with, but if you actually plan to do more than run a small LLM and say "neat", especially for a business, recommending a Mac for anyone getting started w/ ML workloads is a bad take. (In general, for anyone getting started, unless you're just burning budget, renting cloud GPU is going to be the best cost/perf, although on-prem/local obviously has other advantages.)
Just a note to say thank you for this detailed reply! I did not know these things, and am getting a Mac Studio of similar spec for work soon (for reasons unrelated to AI) and it's helpful to know what to expect about its ML capabilities.
(Still, how much would you have to spend to get 192GB of GPU RAM available to you, fully purchased? The 192GB Mac Studio M2 Ultra is around $5800. Is that the difference between sort-of-GPU speeds and falling down to CPU speeds, if you want to run e.g. the best, largest open source models available?
I suppose even "falling down to CPU speeds" isn't really plausible -- I think you'd find it hard to put 192GB DDR5 (at least without falling to speeds below DDR4) in any fast, modern desktop because they all have two channels of DDR5.
If you want to simply run inference or do QLoRA fine tunes of "the best, largest open source models" eg the llama2-70b models, you can do so with 2 x RTX 3090 24GB (~$600 used), so for about $1200 for the GPUs, 48GB of VRAM (set to PL 300W, so 600W while inferencing) - q4 version of llama2-70b take about 38-40GB of memory + kvcache.
If you want 192GB of VRAM, your cheapest realistic option is probably going to be 4 x A6000's (~$16,000) - you will need to have a chassis that will provide adequate power and cooling (1200W for the GPUs). I'd personally suggest that anyone looking to buy that kind of hardware have a fairly good idea of what they're going to use it for beforehand.
I'm not sure what exactly you're asking about with regards to memory, but for workstations, the Xeon W-3400's have 8 channels of DDR5-4800 (the W5-3425 has a $1200 list price) and the upcoming Threadripper Pro 7000s will likely have similar memory support (or you can get an EPYC 9124 for ~$1200 now if you want 12 channels of DDR5).
Would it be worthwhile just using "cloud GPUs" (like the providers who rent out GPUs, not the overpriced AWS stuff) until the next generation comes out, then using that?
Yes, it runs totally fine. I ran it in Oobabooga/text generation web ui. Nice thing about it is that it autodownloads all necessary gpu binaries on it's own and creates a isolated conda env. I asked same questions on the official 70b demo and got same answers. I even got better answers with ooba, since the demo cuts text early
Not really? I just got those cards in separate PCI slots and the Exllama_hf handles spreading the load internally. No NVLink bridge in particular. I use the "20,22" memory split so that the display card has some room for the framebuffer to handle display
Do you mean you don't use NVLink or just use one that works? I am under the impression it is being phased out ("PCIe 5 is fast enough") and some kits don't use it.
The 7B model quantized to 4 bits can fit in 8GB VRAM with room for the context, but is pretty useless for getting good results in my experience. 13B is better but still not anything near as good as the 70B, which would require >35GB VRAM to use at 4 bit quantization.
My solution for playing with this was just to upgrade my PC's RAM to 64GB. It's slower than the GPU, but it was way cheaper and I can run the 70B model easily.