undefined

points

[-]

This confused me at first as well.. inactive experts skip compute, but weights are sill loaded. So memory does not shrink at all.

I found this visualisation helpful - https://vectree.io/c/sparse-activation-patterns-and-memory-e...

by IceWreck1 days ago|

prev|

[-]

It does if you use an inference engine where you can offload some of the experts from VRAM to CPU RAM. That means I can fit a 35 billion param MoE in let's say 12 GB VRAM GPU + 16 gigs of memory.

by Yukonv1 days ago|

parent|

[-]

With that you are taking a significant performance penalty and become severely I/O bottlenecked. I've been able to stream Qwen3.5-397B-A17B from my M5 Max (12 GB/s SSD Read) using the Flash MoE technique at the brisk pace of 10 tokens per second. As tokens are generated different experts need to be consulted resulting in a lot of I/O churn. So while feasible it's only great for batch jobs not interactive usage.

by IceWreck23 hours ago|

parent|

[-]

> So while feasible it's only great for batch jobs not interactive usage.

I mean yeah true but depends on how big the model is. The example I gave (Qwen 3.5 35BA3B) was fitting a 35B Q4 K_M (say 20 GB in size) model in 12 GB VRAM. With a 4070Ti + high speed 32 GB DDR5 ram you can easily get 700 token/sec prompt processing and 55-60 token/sec generation which is quite fast.

On the other hand if I try to fit a 120B model in 96 GB of DDR5 + the same 12 GB VRAM I get 2-5 token/sec generation.

by zozbot23423 hours ago|

parent|

[-]

Your 120B model likely has way more active parameters, so it can probably only fit a few shared layers in the VRAM for your dGPU. You might be better off running that model on a unified memory platform, slower VRAM but a lot more of it.

by IceWreck57 minutes ago|

parent|

[-]

Yep, I understand I was giving an example to the person I was replying to.

by zozbot23423 hours ago|

parent|

prev|

[-]

10 tok/s is quite fine for chatting, though less so for interaction with agentic workloads. So the technique itself is still worthwhile for running a huge model locally.

by charcircuit1 days ago|

prev|

[-]

You never need to have all weights in memory. You can swap them in from RAM, disk, the network, etc. MOE reduces the amount of data that will need to be swapped in for the next forward pass.

by martinald23 hours ago|

parent|

[-]

Yes you're right technically, but in reality you'd be swapping them the (vast?) majority in and out per inference request so would create an enormous bottleneck for the use case the author is using for.

by zozbot23423 hours ago|

parent|

[-]

With unified memory, reading from RAM to GPU compute buffer is not that painful, and you can use partial RAM caching to minimize the impact of other kinds of swapping.

by mikkupikku11 hours ago|

parent|

[-]

In practical terms, is this kind of architecture available to consumers except through Apple?

by the_pwner22410 hours ago|

parent|

[-]

AMD Strix Halo. Available in the Framework desktop, various mini PCs, and the Asus Rog Flow Z13 "gaming tablet." The Z13 is still at $2700 for 128 GB which is an incredible deal with today's RAM prices.

There's also the Nvidia DGX Spark.

by charcircuit21 hours ago|

parent|

prev|

[-]

You don't have to only have the experts being actively used in VRAM. You can load as many weights as will fit. If there is a "cache miss" you have to pay the price to swap in the weights, but if there is a hit you don't.