The limited context is problematic. I’m not exactly sure what it’s got available but hermes was hit and miss on a prospecting job.
It does seem to be doing useful work but it’s not API call level quality
If that's accurate, then you must be doing something wrong/weird. On a single RTX 3090, I'm seeing substantially higher performance. Dual GPU won't necessarily give a ton of performance improvement, but it shouldn't hurt performance.
With llama-bench, I just measured Qwen3.6-27B at 41 tok/s and Qwen3.6-35B-A3B at 153 tok/s on one RTX 3090. (Those results are without MTP. With MTP, I'm seeing about 65 to 70 tok/s for Qwen3.7-27B.)
I'm using the unsloth UD-Q4_K_XL quant. If you're using bf16 for some reason, that could explain the low performance and inability to have enough context despite having 48GB of VRAM, I guess, but... don't do that.
Are you running with MTP enabled? I have seen some people on M5 hardware report 20+ t/s on Qwen3.6-27B using MTP... and I think that was a regular M5, not even M5 Pro.