More

christina97 · 2026-06-03T20:06:58 1780517218

It seems worse in all aspects to the 26B A4B? I would have thought dense models beat MoE still on many benchmarks?

Is the entire point of this model then that it runs if you don’t have enough GPU memory to load the 26B? That one runs faster anyway due to lower active params.

christina97 · 2026-05-27T21:18:17 1779916697

I think the key thing that depreciates is all their models. You train one at crazy cost and 6 months later it’s worth $0. If you ignore that depreciation you look much more profitable.

ACCount37 · 2026-05-27T21:26:29 1779917189

Model inference compute outweights training compute by 10:1 and more for frontier LLMs. "LLM depreciation" is an expense, but not a dealbreaker.

christina97 · 2026-05-26T18:56:28 1779821788

The SSD would wear out in days while the laptop generates two responses a day. This is like saying you could power your home with AA batteries, yes technically you could but in practice entirely infeasible.

adrian_b · 2026-05-26T22:27:45 1779834465

There is no wear on the SSDs, because the weights are just read, they are not written during inference.

For model training, the requirements are very different, and the training of a big LLM cannot be done with home equipment. On the other hand, inference can be done on almost any PC, even for LLMs with thousands of billions of parameters, just very slowly.

The only problem is that the inference becomes limited by the SSD reading throughput. Most of the cheap new personal computers available today can read simultaneously only 2 SSDs (if there are more they share a reading path), which are typically 1 PCIe 5.0 SSD and 1 PCIe 4.0 SSD. This has an upper throughput limit of 24 Gbyte/s, with 15 to 20 GB/s achievable in practice.

Then the speed in token/s is limited by the amount of weights that must be read per inference cycle. The ratio between output tokens and the amount of weights that must be read can be improved by various methods, like batching multiple tasks or using speculative decoding.

jurgenburgen · 2026-05-27T06:30:09 1779863409

Does more RAM increase performance? This approach sounds like it could eventually be fast enough for local use as hardware and models improve.

zozbot234 · 2026-05-27T07:46:34 1779867994

Faster SSD access improves performance more than RAM does, at least until all of the model is being cached in RAM. So older and cheaper HEDT platforms with lots of PCIe lanes to attach storage to are best for this approach.

jyounker · 2026-05-26T19:14:58 1779822898

Weights are write-once data.

christina97 · 2026-05-21T22:06:51 1779401211

My earlier research suggests NVIDIA does not actually cap spikes, it caps the average over short periods of time. So setting the power limit is no guarantee.

christina97 · 2026-05-09T03:45:37 1778298337

But it’s okay to be down when the whole internet is down.

christina97 · 2026-05-08T14:21:48 1778250108

While I’m sympathetic to this argument (it would be great if the internet were a safe place), in practice this thinking leads to governments trying to impose legislation that hurts legitimate uses but does little to protect from the long tail of harm. There’s little that can be done about North Korean state sanctioned cybercrime without a great firewall.

If the perpetrators of this hack were caught and in a developed country, they would certainly be prosecuted for their crimes and not get off light (especially if any data is actually leaked).

jameshart · 2026-05-08T14:32:19 1778250739

I think states should be able to do better than a ‘great firewall’ to defend their domestic net infrastructure from malicious foreign actors.

But I do think it should be much more states’ responsibility to make their domestic network safe for citizens and businesses and institutions to operate.

christina97 · 2026-05-05T17:31:51 1778002311

I recently set up the 26B A4B model up on vLLM on an RTX3090 (4-bit) after a hiatus from local models. Just completely blown away by the speed and quality you can get now for sub-$1k investment.

I tried first with Qwen but it was unstable and had ridiculously long thinning traces!

aimxhaisse · 2026-05-05T20:03:17 1778011397

It even fits on a 3060 with turboquant / Q4 at decent speed (40T/s) for ~200$ (:

2ndorderthought · 2026-05-05T19:20:27 1778008827

Some of the early quants for qwen3.6 were broken. It's still finicky but with a little hand holding it's crazy.

Local models are the future it's awesome

jszymborski · 2026-05-05T17:34:22 1778002462

The A4B model is blazing fast and the model is super good at general inquiries. Notably worse than Qwen 3.6 for coding tasks but that says more about the Qwen model.

maille · 2026-05-05T22:19:33 1778019573

Bad at coding, but would it be good at code review?

avadodin · 2026-05-06T09:47:05 1778060825

Good compared to what? Nothing? Probably better.

moffkalast · 2026-05-05T20:55:42 1778014542

The 31B is surprisingly fast too, for a dense model. Runs tg at least twice as fast as it ought to on my machine when compared to other 30B, probably due to the hybrid attention I guess. Ingestion is somewhat slower though.

christina97 · 2026-05-04T17:53:02 1777917182

You’re absolutely right!

christina97 · 2026-05-01T23:44:19 1777679059

By and large cooling, just like a data center.

christina97 · 2026-05-01T03:16:25 1777605385

If it’s so easy then why don’t we have a high quality classifier?