undefined

upvote

points

by Catloafdev1 days ago |

upvote

by bitexploder1 days ago|

[-]

For a MBP I have 48 GB of RAM M5 Pro. It runs at about 12-14 t/s at Q4, you could probably optimize it further. RAM is not a limitation but overall memory bandwidth. Q8 is slower. 35B A3B Qwen is quite speedy, but a little less accurate. With Qwen 3.6 27B dense I can squeeze a 9B parameter model and use that for fast analysis or code scanning while 27B is churning on a task in the background. It is tight, but totally reasonable.

The real sweet spot for Qwen 27B is getting it on something like a Dual 3090 system or some other config where it can blaze at 50-80 t/s and that costs well under 6K currently. It is a surprisingly capable model. Using something like GLM for orchestration, specs, task farming and then letting Qwen churn is relatively inexpensive.

Overall I recommend people try models of this class out using OpenCode and some for pay service to experiment with them and understand how they work. I find they are very useful.

Long term, I am convinced enough that if I wanted to use local models for any number of reasons I would be okay investing in a dual GPU box. The Mac is not fast enough for me and M5 Max is just too expensive relative to GPU linux box. Still, it is nice to have the models local ON the laptop and it is useful for what I care about locally.

reply

upvote

by aunty_helen1 days ago|

[-]

I was doing some benchmarking last night on 2 3090s. The systems but old but I’m seeing 11tks 27b, 15tks 35b MoE.

The limited context is problematic. I’m not exactly sure what it’s got available but hermes was hit and miss on a prospecting job.

It does seem to be doing useful work but it’s not API call level quality

reply

upvote

by coder54321 hours ago|

[-]

> The systems but old but I’m seeing 11tks 27b, 15tks 35b MoE

If that's accurate, then you must be doing something wrong/weird. On a single RTX 3090, I'm seeing substantially higher performance. Dual GPU won't necessarily give a ton of performance improvement, but it shouldn't hurt performance.

With llama-bench, I just measured Qwen3.6-27B at 41 tok/s and Qwen3.6-35B-A3B at 153 tok/s on one RTX 3090. (Those results are without MTP. With MTP, I'm seeing about 65 to 70 tok/s for Qwen3.7-27B.)

I'm using the unsloth UD-Q4_K_XL quant. If you're using bf16 for some reason, that could explain the low performance and inability to have enough context despite having 48GB of VRAM, I guess, but... don't do that.

reply

upvote

by aunty_helen5 hours ago|

[-]

Good to know. Might be worth updating the motherboard then, it’s limited in pcie speed.

reply

upvote

by coder54321 hours ago|

[-]

> For a MBP I have 48 GB of RAM M5 Pro. It runs at about 12-14 t/s at Q4

Are you running with MTP enabled? I have seen some people on M5 hardware report 20+ t/s on Qwen3.6-27B using MTP... and I think that was a regular M5, not even M5 Pro.

reply

upvote

by bitexploder20 hours ago|

[-]

Nope. MLX in LMStudio. The simplest config with zero tuning effort.

reply

upvote

by coder54320 hours ago|

[-]

Unsloth Studio is also very low effort, and a lot better than LM Studio in my opinion. (Performance, compatibility with Gemma 4, actually open source, etc.)

reply

upvote

by CMay1 days ago|

[-]

At 24GB, Gemma 4 31B QAT will be better and give more concise answers. This post is mostly about unquantized results, so it's less relevant and I can't say much about as I haven't tested Qwen or Gemma via cloud API or unquantized locally. All I can say is locally, quantized in a 24GB scenario, Gemma 4 31B is better in my tests which are mostly reasoning or C programming related.

Gemma 4 is the only model series at this parameter scale I've seen correctly answer some of these. One of the answers even made me re-evaluate what I thought the correct answer was, which I did not expect.

When I look at the Artificial Analysis numbers, I can see that some things about Qwen 3.6 look inflated as a result of either metrics that weren't measured yet for Gemma 4 31B, or for metrics that just aren't going to be relevant in a lot of the essential tasks. In a lot of the relevant metrics, Gemma 4 is either better or on par.

Then once it's all quantized all those benchmark results will be hurt, and Gemma 4 QAT has better quantized performance. I think it's more competitive unquantized than people give it credit for and way better quantized than people give it credit for.

Qwen 3.6 clearly isn't legitimately bad and maybe it's quite nice at fp16, but it was a disaster quantized in a 24GB scenario by comparison.

reply

upvote

by thewebguyd1 days ago|

[-]

I'd go for at least 32GB+. It'll fit in 24GB but leaves you little to no room for context, and that's at 4-bit quantization.

If you want to run unquantized, you definitely need 128GB.

reply

upvote

by Catloafdev1 days ago|

[-]

Nobody runs unquantized, there's literally no reason to. Q8 would be the largest anyone actually runs on consumer hardware for inference.

reply

upvote

by 23 hours ago|

[-]

deleted

reply

upvote

by bityard23 hours ago|

[-]

Halving the precision of the weights is not a free lunch...

reply

upvote

by Catloafdev21 hours ago|

[-]

Q8 is virtually lossless. The quantization is much more noticeable around Q4 and below. FP16->Q8 on consumer hardware is 2x the speed at ~99.99% the quality.

reply

upvote

by rvba12 hours ago|

[-]

Any source that confirms the 99.99% quality?

reply

upvote

by bitexploder1 days ago|

[-]

It also comes down to inference speed, not "can I run this". 8-bit quant is quite a bit slower on an M5 Pro.

reply

upvote

by gchamonlive1 days ago|

[-]

[dead]

reply

upvote

by Numerlor1 days ago|

[-]

And if you go for actual GPUs it'll run much faster, I'd say 24gb may be pushing it for context, but my 5090 with 32GB VRAM is usually somewhere between 60 to 100 tok/s with mtp and 2-3k tok/s for prompt processing. I'm not sure what they cost now but it's definitely still quite far from the macbook, and there's also some other 32GB GPUs that are considerably more affordable

reply

upvote

by nok22kon1 days ago|

[-]

a computer with 24 GB VRAM is at least $3000

reply

upvote

by sleepyeldrazi1 days ago|

[-]

I can't speak for the US, but in Germany (where hardware is usually more expensive, not less), I got my 3090 3 months ago for 750 euro and have been running the iq4_nl 27B using q4 kv (which after recent patches in llama.cpp is in my xp indistinguishably accurate from q8 of f16) at full ctx, with MTP at 2, peaking around 70 t/s on small ctx, around 50 t/s when im around 64k and ends around 40 t/s near the cap. The rest of the PC is a 50 euro ddr3 16gb i5 4th gen box, absolutely nothing special. And this setup is often more useful than dsv4pro (and sometimes kimi, but not glm) for research and ML work.

reply

upvote

by danilocesar1 days ago|

[-]

I can't find a 3090 for less than 2k CADs (or 1200 eur). Is this the average price in Germany? It's pretty cheap.

reply

upvote

by sleepyeldrazi11 hours ago|

[-]

I got it off kleinanzeigen, its a ebay-like site (but mostly 'pick it up yourself' instead of delivery). Looking at it right now, i do see multiple sales for 850-900. I did spot the 750 one after frequenting the site for a week or two, so it may be a bit of a 'better than average' deal, and it seems most are in the 1k euro range, but there are a handful available under.

As of writing this, it shows 24 offers between 700 and 950.

reply

upvote

by akman1 days ago|

[-]

I'm also curious, as this could pay for a trip out there, especially if buying for friends.

reply

upvote

by daemonologist1 days ago|

[-]

A 7900 XTX is about $850, and the rest of the computer basically just needs to boot Linux. You could easily build such a machine for $1500.

Even that isn't strictly necessary - you can get perfectly acceptable performance by splitting a model between multiple older 12 or 16 GB cards.

reply