undefined

points

[-]

The 3B active is small enough that it's decently fast even with experts offloaded to system memory. Any PC with a modern (>=8 GB) GPU and sufficient system memory (at least ~24 GB) will be able to run it okay; I'm pretty happy with just a 7800 XT and DDR4. If you want faster inference you could probably squeeze it into a 24 GB GPU (3090/4090 or 7900 XTX) but 32 GB would be a lot more comfortable (5090 or Radeon Pro).

122B is a more difficult proposition. (Also, keep in mind the 3.6 122B hasn't been released yet and might never be.) With 10B active parameters offloading will be slower - you'd probably want at least 4 channels of DDR5, or 3x 32GB GPUs, or a very expensive Nvidia Pro 6000 Blackwell.

by canpan16 hours ago|

prev|

[-]

Any good gaming pc can run the 35b-a3 model. Llama cpp with ram offloading. A high end gaming PC can run it at higher speeds. For your 122b, you need a lot of memory, which is expensive now. And it will be much slower as you need to use mostly system ram.

by bigyabai15 hours ago|

parent|

[-]

Seconding this. You can get A3B/A4B models to run with 10+ tok/sec on a modern 6/8GB GPU with 32k context if you optimize things well. The cheapest way to run this model at larger contexts is probably a 12gb RTX 3060.

by mildred59316 hours ago|

prev|

[-]

I can run this on an AMD Framework laptop. A Ryzen 7 (I dont have Ryzen AI, just Ryzen 7 7840U) with 32+48 GB DDR. The Ryzen unified memory is enough, I get 26GB of VRAM at least.

Fedora 43 and LM Studio with Vulkan llama.cpp

by terramex16 hours ago|

prev|

[-]

I run Gemma 4 26B-A4B with 256k context (maximum) on Radeon 9070XT 16GB VRAM + 64GB RAM with partial GPU offload (with recommended LMStudio settings) at very reasonable 35 tokens per second, this model is similiar in size so I expect similar performance.

by ru55216 hours ago|

prev|

[-]

You won't like it, but the answer is Apple. The reason is the unified memory. The GPU can access all 32gb, 64gb, 128gb, 256gb, etc. of RAM.

An easy way (napkin math) to know if you can run a model based on it's parameter size is to consider the parameter size as GB that need to fit in GPU RAM. 35B model needs atleast 35gb of GPU RAM. This is a very simplified way of looking at it and YES, someone is going to say you can offload to CPU, but no one wants to wait 5 seconds for 1 token.

by samtheprogram16 hours ago|

parent|

[-]

That estimate doesn't account for context, which is very important for tool use and coding.

I used this napkin math for image generation, since the context (prompts) were so small, but I think it's misleading at best for most uses.

by sliken14 hours ago|

parent|

prev|

[-]

> You won't like it, but the answer is Apple.

Or strix halo.

Seems rather over simplified.

The different levels of quants, for Qwen3.6 it's 10GB to 38.5GB.

Qwen supports a context length of 262,144 natively, but can be extended to 1,010,000 and of course the context length can always be shortened.

Just use one of the calculators and you'll get much more useful number.

by 38362936489 hours ago|

parent|

[-]

What Strix Halo system has unified memory? A quick google says it's just a static vram allocation in ram, not that CPU and GPU can actively share memory at runtime

by sliken3 hours ago|

parent|

[-]

All. Keep in mind strix != strix halo.

You can get tablets, laptops, and desktops. I think windows is more limited and might require static allocation of video memory, not because it's a separate pool, just because windows isn't as flexible.

With linux you can just select the lowest number in bios (usually 256 or 512MB) then let linux balance the needs of the CPU/GPU. So you could easily run a model that requires 96GB or more.

by ac293 hours ago|

parent|

prev|

[-]

> What Strix Halo system has unified memory?

All of them. The static VRAM allocation is tiny (512MB), most of the memory is unified

by rhdunn16 hours ago|

prev|

[-]

The Q5 quantization (26.6GB) should easily run on a 32GB 5090. The Q4 (22.4GB) should fit on a 24GB 4090, but you may need to drop it down to Q3 (16.8GB) when factoring in the context.

You can also run those on smaller cards by configuring the number of layers on the GPU. That should allow you to run the Q4/Q5 version on a 4090, or on older cards.

You could also run it entirely on the CPU/in RAM if you have 32GB (or ideally 64GB) of RAM.

The more you run in RAM the slower the inference.

by 16 hours ago|

parent|

[-]

deleted

by bildung15 hours ago|

prev|

[-]

I currently run the qwen3.5-122B (Q4) on a Strix Halo (Bosgame M5) and am pretty happy with it. Obviously much slower than hosted models. I get ~ 20t/s with empty context and am down to about 14t/s with 100k of context filled.

No tuning at all, just apt install rocm and rebuilding llama.cpp every week or so.