undefined

upvote

points

by briga9 hours ago |

upvote

by pixelesque7 hours ago|

[-]

Out of interest, what machine and model are you running it on?

I tried the qwen3.6-27b Q6_k GUFF in llama.cpp and LM Studio on my M2 MacBook Pro 32GB machine last week, and I barely get a token a second with either.

What sort of speed should I be expecting?

I tried some of the Llama 3 34b (nous-capybara?) models two years ago with llama.cpp, and I seem to remember getting a few tokens a second then, so not sure if I've got something completely mis-configured, or I just have unreasonable expectations.

Or maybe qwen 3.x is slower for some reason? (Is it mixture of experts?)

I'm not expecting it to be instant, but what I'm currently seeing is not really usable.

reply

upvote

by gcr7 hours ago|

[-]

There are two flavors of Qwen 3.6:

- A 27B "dense" model

- A 35B "Mixture of Experts" model, which activates only 3B parameters for each token.

For your hardware, I strongly recommend `unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M`. I have an M1 Max with 32GB VRAM from 2021 that can read at ~300-500 tokens/sec and write at ~30 tokens/sec with llama-cpp's default settings, which is plenty fast. The 27B model can read ~70tok/sec and write ~5tok/sec.

The 35B MoE model technically takes slightly more memory but is much faster because it's doing 1/9th the work. It's not quite as "smart", but it's comparable.

reply

upvote

by flockonus4 hours ago|

[-]

For coding tasks 27B is reported to be much more effective, altho you can probably only run 4b or 5b quants @ this memory.

Recommend https://www.reddit.com/r/LocalLLaMA/ as a great source for this type of discussion.

reply

upvote

by pixelesque6 hours ago|

[-]

Thank you - I'll give that a go!

reply

upvote

by julianlam6 hours ago|

[-]

May I ask why the M instead of XL?

Obviously bigger != better but I don't know what the differences are.

reply

upvote

by DiabloD33 hours ago|

[-]

These are dynamic quants, and they're basically just an indication of how far away from the desired quant it is allowed to go to achieve the goal. Generally, unsloth's toolchain moves quants up, rarely down.

* _0 and _1 do not use K quant and scales 32x32 blocks according to the original (B)F16 values; _0 scales the block using the original max and min values. _1 does this per row instead of per block.

* K quants do something similar, but now splits blocks into subblocks inside a superblock where the superblock has min/max scaling, but the subblocks also have scaling in the range of the superblock's scaling and are stored using less bits.

* K's M, L, XL are just how aggressively the subblocks and their scaling factors are chosen. Generally, it puts a max on how far you can deviate from the chosen quant to maintain the desired quality, but also gives them a bigger budget to perform that excursion in. XL most aggressively tries to preserve the intended quality, while S does the least.

* Dynamic quant on top of this scales entire layers, full of blocks, according to how much they effect various measurements (such as KLD and perplexity).

That said, there is no reason K_S is even produced by anyone, same with Q_0, Q_1, and I_NL. People should no longer be using those. M only is meaningful if you're trying to restrict the upper bounds: K_XL can reach BF16 for some weights, but rarely; people think this has a speed implication for hardware that has native 8bit in their tensor units (but it doesn't).

Unless you're specifically trying to cure a problem, stick with K_XL.

reply

upvote

by rao-v1 hours ago|

[-]

Hey some of us are on hardware (gfx906 based Radeon MI50s with 32GB of stupidly fast VRAM and basically no compute) that inference significantly faster with Q_0 and Q_1 quants

reply

upvote

by srcrip1 hours ago|

[-]

You seem to understand this stuff pretty well, any recommendations on resources (blogs, YouTube channels, whatever) for software engineers that want to keep up with this stuff on this kind of level?

A lot of the content about AI out there is kind of produced to the lowest common denominator. Basically a never ending scheme of get rich quick/passive income kinds of AI content.

reply

upvote

by DiabloD34 hours ago|

[-]

I recommend sticking with the dense models for both Qwen and Gemma.

On testing I've done on same-quant apples to apples, with F16/F16 (ie, unquantized) kv cache, 35B-A3B underperforms against 27B on anything even remotely complex. But yes, 35B-A3B can be like 3-4x faster on my hardware.

By Qwen's own admission, on any meaningful benchmark (ie, ones that involve logic, math, or tool calling), 27B performs like 122B-10B and 397B-A17B, but 35B-A3B is somewhere between 27B dense and 9B dense.

Also, MTP recently got merged in, so I'd suggest downloading Qwen 3.6 MTP (I assume you get it from unsloth) and updating your copy of llama.cpp, and adding `--spec-type draft-mtp --spec-draft-n-max 2` to your arguments.

https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/ https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/

Also, I recommend not quantizing kv cache, and if you do, only quantize v. Lowering model quant while also lowering context size to fit F16/F16 or F16/Q8_0 massively improves model performance for thinking models. Also, quantizing cache, either k or v, decreases speed by a lot on some hardware.

I have a 24gb 7900xtx, so I can fit >32k F16/F16 context with Qwen3.6-27B, but use unsloth's Q3_K_XL. This performs better than Q(4,5,6)_K_XL with v quantized.

Edit: Oh, and since I mentioned Gemma 4, my testing mirrors my Qwen 3.5/3.6 experiences, 26B-A4B performs worse than 31B, but is also way faster. llama.cpp doesn't support Gemma 4's MTP style yet, so both could get even faster.

reply

upvote

by booty5 hours ago|

[-]

    I tried the qwen3.6-27b Q6_k GUFF in llama.cpp 
    and LM Studio on my M2 MacBook Pro 32GB machine 
    last week, and I barely get a token a second with either.

The fact that it was this slow makes me suspect it's a matter of insufficient free RAM. The entire model needs to fit into RAM (and stay there the entire time) for acceptable performance.

(not sure of exact diagnosis/fix, but definitely look in that direction if you're still having this issue when you give it another shot)

Also, there are two stages - prompt processing, and token generation. Prompt processing is notoriously slow on Apple Silicon unfortunately. If you have large context (which includes system prompts, lots of tools loaded by a harness like Claude Code, OpenCode, etc) it can take minutes for prompt processing before you see the first output token. On the bright side, the tokens are cached between turns, so subsequent turns won't be so bad.

reply

upvote

by mark_l_watson4 hours ago|

[-]

You are using Q6 6 bit quantization; on my 32G MacMini I use Q4 and it is faster but when I use it with OpenCode, I set up a task and go outside to walk for ten minutes. Smart, capable, and slow. Still, I love using local models.

EDIT: I run with context wired at 64K

reply

upvote

by satvikpendem5 hours ago|

[-]

Check out Unsloth Studio it provides MTP support now which 2x the token generation speed with no loss of accuracy: https://unsloth.ai/docs/models/qwen3.6#mtp-guide

reply

upvote

by mft_7 hours ago|

[-]

The 27B model is dense, so is relatively slow. The 35B-A3B model is marginally weaker but being MoE is much faster - like ~4-8x faster in basic benchmarks on my M1 Max.

For comparison, I just ran a couple of quick benchmarks (default settings) with llama-bench:

Qwen3.6-35B-A3B at Q6_K_XL gave 858 t/s pp512 (prompt processing) and 43 t/s tg128 (token generation).

Qwen3.6-27B at Q4_K_XL gave 103 t/s pp512 and 8 t/s tg128.

reply

upvote

by stebalien2 hours ago|

[-]

Have you tried enabling MTP? Those numbers are similar to what I was getting on my Strix Halo box, but configuring/enabling MTP doubled the TG speed of the 27B model (18-20 t/s now).

reply

upvote

by pixelesque6 hours ago|

[-]

Thanks for the info.

reply

upvote

by Figs7 hours ago|

[-]

27B is the dense one. Try the Qwen3.6-35B-A3B variants for the MoE release. That's what I'm running on a Framework Desktop and I get ~50 tok/s plus or minus a few. The dense one is similarly slow for me -- not sure what to expect on your hardware from the MoE but it should probably be much faster.

reply

upvote

by pixelesque6 hours ago|

[-]

Thanks!

reply

upvote

by 1272 hours ago|

[-]

I get 150t/s peak, 120t/s avg with Qwen3.6 27B Q4 with a 4090 on Linux. Now that MTP has landed into llama.cpp.

reply

upvote

by KronisLV7 hours ago|

[-]

> qwen3.6-27b Q6_k

That's the dense model, you probably want a mixture-of-experts (MoE) one.

Here's what you probably want instead: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

reply

upvote

by pixelesque6 hours ago|

[-]

Thanks!

reply

upvote

by dzr00015 hours ago|

[-]

My token throughput is much better using vLLM-mlx on my M2 ultra than llama.cpp. It might be worth a shot to give it a try.

reply

upvote

by electroglyph1 hours ago|

[-]

you should be using dflash with that model, look it up

reply

upvote

by plufz9 hours ago|

[-]

Which exact model are you using? And with which parameters and quant? And on what hardware? Are you using any specific MCPs or other tools to optimize performance like context-mode or dynamic context pruning? I’ve used local models a reasonable amount before but I’m just starting out with opencode. Haven’t had great results yet but really want this to work for simpler tasks. My opencode newly installed is also having iterm on 100% cpu in idle. :/

reply

upvote

by briga9 hours ago|

[-]

I'm running Qwen3.6:27b Q4 KM on a 4090 and similarly fast CPU and I think 32GB of RAM. Make sure the context window is set to be big enough otherwise the conversation will keep compacting. No special MCP tools set up yet. Qwen is able to do web search out-of-the-box although I think it is getting blocked by anti-bot firewalls--I still need to figure out if I can fix that.

reply

upvote

by SeriousM6 hours ago|

[-]

This is the repo: https://huggingface.co/pbhappliedsystems/qwen3.6-27B-gguf-Q4...

reply

upvote

by gcr7 hours ago|

[-]

here's a simple setup to get you started on an Apple M1 Max from 2021 with 32GB VRAM. it will download 20GB of models to `~/.cache/huggingface/hub`, which you can delete when you're done.

  /Users/gcr/llama.cpp/build/bin/llama-server
      -hf unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M
      --no-mmproj-offload
      --fit on
      -c 65536 # edit to taste
      --reasoning on --chat-template-kwargs '{"preserve_thinking": true}'
      --sleep-idle-seconds 90 # very aggressive: purge model from vram after this long
      -ctk q8_0 -ctv q8_0 # Optional. Lower memory use, but lower speed. Omit if you can.

I don't recommend ollama or lm-studio. Ollama's in the process of switching from their llama-cpp backend anyway, but their new go framework frequently OOMs and crashes on my hardware. I also don't recommend MLX-based inference backends on this hardware; I've found them to consistently reduce performance, contrary to what I've read online. I've tried all the llama-cpp metal forks, but right now, MTP, TurboQuant, MLX, etc etc etc are too new and just slow things down. It's all dust in the wind still.

For agent harnesses, opencode is okay, as is pi or even Zed's built in agent panel. Claude code "works" with ANTHROPIC_BASE_URL=http://localhost:8080/v1, but is very chatty (the default system prompt burns 20k tokens). Crush (from the charm-bracelet folks) is particularly nice when starting out. I've personally converged on pi-agent under an otherwise-mostly-default setup. You can ask qwen to customize pi or write you an extension which helps a little.

You'll need to add `http://localhost:8080/v1` as an OpenAI-compatible model provider in your coding harness with any API key (doesn't matter) and any model identifier (doesn't matter with llama-cpp).

Note that pi doesn't have permissions. Everything is permitted. The hundred hungry ghosts you've trapped in a jar WILL find a way to delete your home folder someday. That's what Man gets for summoning demons without casting a circle of protection first. Flying too close to the sun etc etc etc

Take backups and then go have fun. Hope this helps.

reply

upvote

by srcrip1 hours ago|

[-]

Can you elaborate more on the differences in running ollama or lmstudio? Do they actually slow down the speed of the inference and if so why? Or is it just a preference thing?

reply

upvote

by chr15m52 minutes ago|

[-]

This new version is not something you'll be able to run locally. It's a "cloud" model and likely too beefy if they do release the weights.

reply

upvote

by leonidasv9 hours ago|

[-]

Qwen Max are usually closed, unfortunately.

reply

upvote

by mostafab3 hours ago|

[-]

That's a signal of being SOTA.

reply

upvote

by wuliwong5 hours ago|

[-]

Do you have a feel for how it Qwen 3.6 compares to Sonnet 4.6? B/C in reality, that's what we use a lot. If we just use Opus 4.7 for everything code related, we'd have a monthly bill 10-20 times higher than using Sonnet where we can.

reply

upvote

by briga4 hours ago|

[-]

I would say if Sonnet is a senior engineer, then Qwen3.6 (the 27b model) is probably closer to a junior engineer. Still capable of getting stuff done, just needs more guidance and makes mistakes more often.

Maybe that's underselling it. It is quite a good model and might end up replacing a lot of the work I was sending to Sonnet 4.6.

Also, Sonnet 4.6 is almost certain a much bigger model so the performance differences aren't unexpected.

reply

upvote

by ecshafer8 hours ago|

[-]

Qwen3.6 with claude code works great. I get a lot better results with that than opencode and qwen3.6. Claude Code is a great harness, and good harness/tool integration makes a big difference. You just have a settings.json with your ollama setup and the qwen model and you can use it.

reply

upvote

by growt5 hours ago|

[-]

Where and how do you run that? I tried it but somehow I always ran out of context or generation was incredibly slow (mbp m4 pro 48gb).

reply

upvote

by kolinko6 hours ago|

[-]

As Opus maximalist ;) I was very surprised by the quality if Qwen3.6-27B - trying to figure out how to get it going on RTX 90k now to offload some lighter tasks :)

reply

upvote

by aembleton4 hours ago|

[-]

> Today we introduce Qwen3.7-Max, our latest proprietary model

This is not an open model

reply

upvote

by ttoinou5 hours ago|

[-]

Which agentic coding tool and how do you make sure you have prefix consistency ?

reply

upvote

by wouldbecouldbe7 hours ago|

[-]

This one doesnt seem to be open source though sadly. Using chinese servers is a step to far for me personally

reply

upvote

by gcr7 hours ago|

[-]

Look for an open release from the Qwen team in the coming weeks. They like to showcase their proprietary models first, which score higher on benchmarks anyway due to model size.

reply

upvote

by par7 hours ago|

[-]

Do you have an opinion on OpenCode vs Aider?

reply

upvote

by briga4 hours ago|

[-]

I haven't tried Aider yet but perhaps I will. Another one that seems to be getting traction is Pi Coding Agent.

reply

upvote

by sunaookami3 hours ago|

[-]

Aider is still around? That is pre-tool-calling era stuff. Better compare against Pi.

reply

upvote

by par1 hours ago|

[-]

I just started running coding agents locally. So you recommend Pi over opencode? (And obviously aider is out?)

reply