upvote
I was doing some benchmarking last night on 2 3090s. The systems but old but I’m seeing 11tks 27b, 15tks 35b MoE.

The limited context is problematic. I’m not exactly sure what it’s got available but hermes was hit and miss on a prospecting job.

It does seem to be doing useful work but it’s not API call level quality

reply
> The systems but old but I’m seeing 11tks 27b, 15tks 35b MoE

If that's accurate, then you must be doing something wrong/weird. On a single RTX 3090, I'm seeing substantially higher performance. Dual GPU won't necessarily give a ton of performance improvement, but it shouldn't hurt performance.

With llama-bench, I just measured Qwen3.6-27B at 41 tok/s and Qwen3.6-35B-A3B at 153 tok/s on one RTX 3090. (Those results are without MTP. With MTP, I'm seeing about 65 to 70 tok/s for Qwen3.7-27B.)

I'm using the unsloth UD-Q4_K_XL quant. If you're using bf16 for some reason, that could explain the low performance and inability to have enough context despite having 48GB of VRAM, I guess, but... don't do that.

reply
Good to know. Might be worth updating the motherboard then, it’s limited in pcie speed.
reply
> For a MBP I have 48 GB of RAM M5 Pro. It runs at about 12-14 t/s at Q4

Are you running with MTP enabled? I have seen some people on M5 hardware report 20+ t/s on Qwen3.6-27B using MTP... and I think that was a regular M5, not even M5 Pro.

reply
Nope. MLX in LMStudio. The simplest config with zero tuning effort.
reply
Unsloth Studio is also very low effort, and a lot better than LM Studio in my opinion. (Performance, compatibility with Gemma 4, actually open source, etc.)
reply