undefined

points

[-]

Personally, I would always max out the RAM you can fit into your budget. You might get lower bandwidth (= slower generation) than you do on a Mac if you choose a Strix Halo or DGX Spark, but there are always new tweaks being discovered to speed things up. That being said, with 32GB you should be able to fit an ok quant of 35B-A3B or 27B with some context, with 64GB you should be golden.

by sleepybrett2 hours ago|

parent|

[-]

i have issues on a m5/64g with 35b-a3b (mlx) it eventually hits a memory cap around 52gb... but i'm pretty happy with `Qwen3.6-27B-Claude-Opus-Reasoning-Distilled-mlx-8Bit`

by c7b2 hours ago|

parent|

[-]

I'm sure there will be a fix for it, but it illustrates an important broader point I should probably have made above: if you opt for local AI today, expect to run into some issues. Expect to learn a bit about the tools you're using, the not-so-fun way. I'm not recommending it to non-technical friends (yet).

by jadbox8 hours ago|

prev|

[-]

A PC with an nvidia card with 16gb vram works just fine for Qwen MoE models, and these have worked great as a daily driver for me.

by SwellJoe1 hours ago|

prev|

[-]

A 4-bit quantization of either Qwen 3.6 27b or Gemma 4 31b will run on a 32GB Mac with a decent-sized, but not full-sized, context. 64GB gets you the full ~256k context and you don't need to quantize your KV cache (though 8-bit quantization of KV may be worth it for performance). The 4-bit QAT version of Gemma 4 has practically identical performance to the full size version or the 8-bit version in most benchmarks and my tests, so there's no reason to run anything else. The 4-bit Qwen is a little bit lossy, as it hasn't gotten the QAT treatment, but not catastrophically lossy. A 6-bit dynamic quantization would be better for that model, but it's ~25GB on disk, and you'll need more than 32GB to run it with a big context.

I wrote up how I run local LLMs, with numbers and a focus on running Qwen 3.6 and Gemma 4. I prefer Gemma 4 31b, even though the general consensus is that Qwen 3.6 is better for code, and it is better on most coding focused benchmarks...it doesn't seem to be for my use cases, Gemma feels smarter. And, with QAT, you get more smarts in less memory, so it's fast and runs on more hardware.

https://swelljoe.com/post/how-i-run-local-llms/

Currently, the sweet spot for self-hosted models is either Qwen 3.6 or Gemma 4, and those top out at 31B (Gemma) and 35B (for Qwen, but you want the dense Qwen 3.6 27B if you can run it as reasonable speed...the dense models are much smarter), so for now, a system with 64GB or 128GB is going to be running the same models. Going to a bigger model doesn't get you better performance because there aren't any better models that are a little bigger. I wish there was a ~70B or even ~120B MoE in the Qwen 3.6 or Gemma 4 families, as I've got a Strix Halo running a model that leaves a lot of memory on the table (and it's not very fast, to boot...an MoE would be faster, and hopefully smarter if it's a much bigger model, like double or triple sized).

In short, right now, 64GB is all you need for the best models you can self-host on anything short of five-figure machines, but, I wouldn't buy any hardware right now, if you can wait a while. Tokens from DeepSeek are so cheap, you can wait out the memory shortage and get access to models you could never host locally. And, OpenRouter always has free models in preview or just because that you can use lightly, as they're rate-limited (but your self-hosted models are going to be rate-limited, too, because a Mac Mini can't run models very fast). Google AI Studio has the Gemma 4 models for free too, also rate/usage limited.

by mathgeek7 hours ago|

prev|

[-]

Good summary blog: https://maloyan.xyz/blog/running-qwen-locally-mac-mini-m4

by coredog645 hours ago|

parent|

[-]

> That's not hypothetical — it's a real measurement on the base model Mac Mini.

Hmmm

by blensor8 hours ago|

prev|

[-]

I am curious if you implicitly assumed they are Macs or if that's what you are looking for specifically?

by JSR_FDED7 hours ago|

parent|

[-]

I assumed the 27B dense model would be preferable to a MoE model, and that it wouldn’t fit into a consumer graphics card, which leaves the Macs.

Then I assumed for cost and battery/heat reasons that a Mini would be better than a laptop.

by SwellJoe53 minutes ago|

parent|

[-]

The current dense models from Gemma 4 or Qwen 3.6 families will run well on a consumer GPU with 32GB in a 4-bit quantization (which is a little lossy for Qwen 3.6, not so much for Gemma 4, as it has a QAT 4-bit version). Even an Intel ARC B70 will work, though it's worth spending a little more for a the AMD Radeon AI Pro 9700, as it'll be like 40% faster, I think. A dedicated GPU will be faster and cheaper than a Mac Mini. But, nothing is a good deal right now, everything is overpriced (except DeepSeek tokens, which cost pennies to run a model that's better than anything you could self-host...DeepSeek V4 Flash, and even Pro, are absurdly cheap, made even cheaper by their bonkers cheap cached token pricing and uniquely effective caching).

by blensor7 hours ago|

parent|

prev|

[-]

The reason why I was curious is that I am running my stuff on a Strix Halo and I get the feeling that this class of devices ( gmktek, minisforum, lenovo, etc. ) seem to becoming a pretty good alternative

by c7b4 hours ago|

parent|

[-]

Unified memory feels like the future of consumer hardware, agreed! Do check out r/StrixHalo

by blensor1 hours ago|

parent|

[-]

Agreed, it was a bit of a pain to get running on my Ubuntu machine because I had old amdgpu-dkms-firmware packages installed without realizing it. But now that it's running it's amazing how well it works

by c7b1 hours ago|

parent|

[-]

Sounds like you got it sorted, but more generally this may be interesting: https://github.com/kyuz0/amd-strix-halo-toolboxes

by adastra224 hours ago|

parent|

prev|

[-]

Strix Halo is better performance than a Mac Mini, but not as good as a Mac Studio. But the 128GB unified memory is awesome for larger models.

by mswphd3 hours ago|

parent|

prev|

[-]

dense models are (more) compute heavy, so are generally worse to run on mac. mac tends to be better for (larger) MoE models.

27B dense can fit on a consumer graphics card. Even without getting into various "intrusive" ways to shrink the size of a model (e.g. REAP), something like a NVFP4 quant of Qwen3.6 27b

https://huggingface.co/nvidia/Qwen3.6-27B-NVFP4

should fit within ~22GB of VRAM. So easily on a 5090. It would also fit on a 3090/4090, but iirc they don't have NVFP4 natively, so you would want a different quant for them.

you can see /r/LocalLLama for some discussions. See this (random) post about Qwen3.6-27B on a 3090 at ~100 tok/s

https://www.reddit.com/r/LocalLLaMA/comments/1ujo46r/qwen_36...

Note that it is possible you could still do this stuff with a mac, as there are ways of hooking up a eGPU to macs and using it for inference. My understanding is they're all fairly hacky though, so it would likely be preferrable to just get a 3090 (or a non-nvidia option, e.g. an AMD r9700 pro has ~32GB of VRAM for much cheaper than a 5090.

https://www.reddit.com/r/LocalLLaMA/comments/1u50hnm/qwen_27...

that seems considerably slower though (~30 tok/s). I don't know if that's an outlier/misconfigured setup or what. In general there will be much better resources for local setups using 3090s, as they're quite popular. Note that 3090s (but not 4090s nor 5090s) have NVLink, so you can network the cards fairly effectively. For this reason 2x 3090 setups are fairly popular as well. I've heard that club 3090 makes that relatively straightforward

https://github.com/noonghunna/club-3090

but don't have experience myself.