It seems worse in all aspects to the 26B A4B? I would have thought dense models beat MoE still on many benchmarks?
Is the entire point of this model then that it runs if you don’t have enough GPU memory to load the 26B? That one runs faster anyway due to lower active params.
I think the key thing that depreciates is all their models. You train one at crazy cost and 6 months later it’s worth $0. If you ignore that depreciation you look much more profitable.
The SSD would wear out in days while the laptop generates two responses a day. This is like saying you could power your home with AA batteries, yes technically you could but in practice entirely infeasible.
There is no wear on the SSDs, because the weights are just read, they are not written during inference.
For model training, the requirements are very different, and the training of a big LLM cannot be done with home equipment. On the other hand, inference can be done on almost any PC, even for LLMs with thousands of billions of parameters, just very slowly.
The only problem is that the inference becomes limited by the SSD reading throughput. Most of the cheap new personal computers available today can read simultaneously only 2 SSDs (if there are more they share a reading path), which are typically 1 PCIe 5.0 SSD and 1 PCIe 4.0 SSD. This has an upper throughput limit of 24 Gbyte/s, with 15 to 20 GB/s achievable in practice.
Then the speed in token/s is limited by the amount of weights that must be read per inference cycle. The ratio between output tokens and the amount of weights that must be read can be improved by various methods, like batching multiple tasks or using speculative decoding.
Faster SSD access improves performance more than RAM does, at least until all of the model is being cached in RAM. So older and cheaper HEDT platforms with lots of PCIe lanes to attach storage to are best for this approach.
My earlier research suggests NVIDIA does not actually cap spikes, it caps the average over short periods of time. So setting the power limit is no guarantee.
While I’m sympathetic to this argument (it would be great if the internet were a safe place), in practice this thinking leads to governments trying to impose legislation that hurts legitimate uses but does little to protect from the long tail of harm. There’s little that can be done about North Korean state sanctioned cybercrime without a great firewall.
If the perpetrators of this hack were caught and in a developed country, they would certainly be prosecuted for their crimes and not get off light (especially if any data is actually leaked).
I think states should be able to do better than a ‘great firewall’ to defend their domestic net infrastructure from malicious foreign actors.
But I do think it should be much more states’ responsibility to make their domestic network safe for citizens and businesses and institutions to operate.
I recently set up the 26B A4B model up on vLLM on an RTX3090 (4-bit) after a hiatus from local models. Just completely blown away by the speed and quality you can get now for sub-$1k investment.
I tried first with Qwen but it was unstable and had ridiculously long thinning traces!
The A4B model is blazing fast and the model is super good at general inquiries. Notably worse than Qwen 3.6 for coding tasks but that says more about the Qwen model.
The 31B is surprisingly fast too, for a dense model. Runs tg at least twice as fast as it ought to on my machine when compared to other 30B, probably due to the hybrid attention I guess. Ingestion is somewhat slower though.
Is the entire point of this model then that it runs if you don’t have enough GPU memory to load the 26B? That one runs faster anyway due to lower active params.
reply