Just a note that you have to have at least 12GB VRAM for it to be worth even try...

lhl · on July 26, 2023

Just a quick note, it's worth pointing out that for most people (eg, wanting to chat to a model in realtime), I don't think running locally on a CPU is a very viable option unless you're very patient. On my 16c Ryzen 5950X/64GB DDR4-3800 system, llama-2-70b-chat (q4_K_M) running llama.cpp (eb542d3) and testing doing a 100 token test (life's too short to try max context), I got 1.25 tokens/second (~1 word/second) output.

Compiled with cuBLAS w/ `-ngl 0` (~400MB of VRAM usage, no layers loaded) makes no perf difference. The max layers I can load on a headless 24GB 4090 is 45/83 layers (running `-ngl 45 --low-vram`) which brings speeds up to 2.5 t/s. A little less painful, but still not super pleasant. For reference, people have reported performance of 12-15 t/s with 2x4090s w/ exllama (GPTQ). People are using a 14,20 split and able to load a full (NTK Rope scaled) 16K context into 48GB of VRAM.

ErneX · on July 25, 2023

Apple Silicon Macs might not have great GPUs but they do have unified memory. I need to try this on mine I have 96GB of RAM on my M2 Max.

diffeomorphism · on July 26, 2023

What does "unified" actually mean and how much would that help? It is still off the shelf LPDDR5‑6400, just with a better interconnnect (like a ps5).

How does this compare to non-unified ddr5 or hmb2e as on nvidia A100 cards?

lhl · on July 26, 2023

The benefits are primarily price - 96GB of VRAM would be 4x3090/4090 (~$6K) or 2xA6000 (~$8-14K) cards (also, looks like you can buy an 80GB A100 PCIe for about $15K atm). While Apple is using LPDDR5, it is also running a lot more channels than comparable PC hardware. The M2 has 100GB/s, M2 Pro 200GB/s, M2 Max 400GB/s, and M2 Ultra is 800GB/s (8 channel) of memory bandwidth. The Nvidia cards are about 900GB/s-1TB/s (A100 PCIe gets up to 1.5TB/s).

In practice, on quantizes of the larger open LLMs, an M2 Ultra can currently inference about 2-4X faster than the best PC CPUs I've seen (mega Epyc systems), but also about 2-4X slower than 2x4090s.

diffeomorphism · on July 26, 2023

That is useful info, but still does not quite address the question.

The question was how memory type, memory amount and bandwidth factor into actual performance. So let me rephrase: Given a budget of $X, what performance/limitations should you expect with

- 256GB of non-unified DDR5 in a PC, just CPU

- 128GB of DDR5 for an APU

- 96GB of unified DDR5

- Whatever Nvidia will sell you for $X.

An answer of "just compare a single memory bandwidth number" seems a bit short. Sure, more bandwidth helps, but is half as much RAM at double bandwidth better or worse?

ErneX · on July 26, 2023

No idea, I just said I wanted to try this out and see how it performs.

Doesn’t VRAM amount limit the size of the model you can load? I’m not talking about training just inference. I also pointed out these are not the greatest GPUs available, just that the advantage they have is being able to address more memory since on those machines is a shared block between system and GPU.

thejosh · on July 26, 2023

It's a term used to justify non replaceable parts :-).

NoMoreNicksLeft · on July 25, 2023

Trying to figure out what hardware to convince my boss to spend on... if we were to get one of the A6000/48gb cards, will that see significant performance improvements over just a 4090/24gb? The primary limitation is vram, is it not?

lolinder · on July 25, 2023

VRAM is what gets you up to the larger model sizes, and 24GB isn't enough to load the full 70B even at 4 bits, you need at least 35 and some extra for the context. So it depends a lot on what you want to do—fine tuning will take even more as I understand it.

The card's speed will affect your performance, but I don't know enough about different graphics cards to tell you specifics.

ycombmehair · on July 26, 2023

How would an APU, such as 5700g with up to 128gb of system ram perform when allocating it as vram? Is this a cost effective way of using running this on a budget?

NoMoreNicksLeft · on July 26, 2023

Well, 48gb is better than nothing at least. And it has the potential (if we get the build right) to drop a second A6000 card into it with the nvlink module (I think this does allow you to effectively have 96gb) later.

cjbprime · on July 25, 2023

You might consider getting a Mac Studio (with as much RAM as you can afford up to 192GB) instead, since 192GB is more (unified) memory than you're going to easily get to with GPUs.

abhibeckert · on July 26, 2023

This. The main system memory on a Mac Studio is GPU memory and there's a lot of it.

It also has the Neural Engine, which is specifically designed for this type of work - most software isn't designed to take advantage of that yet, but presumably it will soon.

lhl · on July 26, 2023

While on the surface, a 192GB Mac Studio seems like a great deal (it's not much more than a 48GB A6000!), there are several reasons why this might not be a good idea:

* I assume most people have never used llama.cpp Metal w/ large models. It will drop to CPU speeds whenever the context window is full: https://github.com/ggerganov/llama.cpp/issues/1730#issuecomm... - while sure this might be fixed in the future, it's been an issue since Metal support was added, and is a significant problem if you are actually trying to actually use it for inferencing. With 192GB of memory, you could probably run larger models w/o quantization, but I've never seen anyone post benchmarks of their experiences. Note that at that point, the limited memory bandwidth will be a big factor.

* If you are planning on using Apple Silicon for ML/training, I'd also be wary. There are multi-year long open bugs in PyTorch[1], and most major LLM libs like deepspeed, bitsandbytes, etc don't have Apple Silicon support[2][3].

You can see similar patterns w/ Stable Diffusion support [4][5] - support lagging by months, lots of problems and poor performance with inference, much less fine tuning. You can apply this to basically any ML application you want (srt, tts, video, etc)

Macs are fine to poke around with, but if you actually plan to do more than run a small LLM and say "neat", especially for a business, recommending a Mac for anyone getting started w/ ML workloads is a bad take. (In general, for anyone getting started, unless you're just burning budget, renting cloud GPU is going to be the best cost/perf, although on-prem/local obviously has other advantages.)

[1] https://github.com/pytorch/pytorch/issues?q=is%3Aissue+is%3A...

[2] https://github.com/microsoft/DeepSpeed/issues/1580

[3] https://github.com/TimDettmers/bitsandbytes/issues/485

[4] https://github.com/AUTOMATIC1111/stable-diffusion-webui/disc...

[5] https://forums.macrumors.com/threads/ai-generated-art-stable...

cjbprime · on July 30, 2023

Just a note to say thank you for this detailed reply! I did not know these things, and am getting a Mac Studio of similar spec for work soon (for reasons unrelated to AI) and it's helpful to know what to expect about its ML capabilities.

(Still, how much would you have to spend to get 192GB of GPU RAM available to you, fully purchased? The 192GB Mac Studio M2 Ultra is around $5800. Is that the difference between sort-of-GPU speeds and falling down to CPU speeds, if you want to run e.g. the best, largest open source models available?

I suppose even "falling down to CPU speeds" isn't really plausible -- I think you'd find it hard to put 192GB DDR5 (at least without falling to speeds below DDR4) in any fast, modern desktop because they all have two channels of DDR5.

lhl · on Aug 7, 2023

Those are very different questions...

If you want to simply run inference or do QLoRA fine tunes of "the best, largest open source models" eg the llama2-70b models, you can do so with 2 x RTX 3090 24GB (~$600 used), so for about $1200 for the GPUs, 48GB of VRAM (set to PL 300W, so 600W while inferencing) - q4 version of llama2-70b take about 38-40GB of memory + kvcache.

If you want 192GB of VRAM, your cheapest realistic option is probably going to be 4 x A6000's (~$16,000) - you will need to have a chassis that will provide adequate power and cooling (1200W for the GPUs). I'd personally suggest that anyone looking to buy that kind of hardware have a fairly good idea of what they're going to use it for beforehand.

I'm not sure what exactly you're asking about with regards to memory, but for workstations, the Xeon W-3400's have 8 channels of DDR5-4800 (the W5-3425 has a $1200 list price) and the upcoming Threadripper Pro 7000s will likely have similar memory support (or you can get an EPYC 9124 for ~$1200 now if you want 12 channels of DDR5).

thejosh · on July 26, 2023

Would it be worthwhile just using "cloud GPUs" (like the providers who rent out GPUs, not the overpriced AWS stuff) until the next generation comes out, then using that?

flangola7 · on July 25, 2023

What is necessary to run 70B on CPU without quantization?

ewokone · on July 26, 2023

Bump. interested as well.

creata · on July 26, 2023

Running just some of the layers on the GPU can still make things much faster, though.

dc443 · on July 25, 2023

I have 2x 3090 do you know if it's feasible to use that 48GB total for running this?

eurekin · on July 25, 2023

Yes, it runs totally fine. I ran it in Oobabooga/text generation web ui. Nice thing about it is that it autodownloads all necessary gpu binaries on it's own and creates a isolated conda env. I asked same questions on the official 70b demo and got same answers. I even got better answers with ooba, since the demo cuts text early

Ooobabooga: https://github.com/oobabooga/text-generation-webui

Model: TheBloke_Llama-2-70B-chat-GPTQ from https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ

ExLlama_HF loader gpu split 20,22, context size 2048

on the Chat Settings tab, choose Instruction template tab and pick Llama-v2 from the instruction template dropdown

Demo: https://huggingface.co/blog/llama2#demo

zakki · on July 25, 2023

Is there any specific settings to make 2x3090 work together?

eurekin · on July 26, 2023

Not really? I just got those cards in separate PCI slots and the Exllama_hf handles spreading the load internally. No NVLink bridge in particular. I use the "20,22" memory split so that the display card has some room for the framebuffer to handle display

vid · on July 26, 2023

Do you mean you don't use NVLink or just use one that works? I am under the impression it is being phased out ("PCIe 5 is fast enough") and some kits don't use it.

eurekin · on July 27, 2023

I don't use NVLink

kwerk · on July 26, 2023

Interested in this too

olavfosse · on July 26, 2023

I'm very curious what your other components are and how you managed to fit 2 3090s in one PC.

esperent · on July 26, 2023

Does that mean you have to run it on the CPU? Or can you use the GPU with system RAM?