upvote
Qwen 3.6 is a toy compared to DeepSeek V4 Flash or Pro. These models can now run on Apple Silicon hardware with as little as 32GB RAM for the Flash (with 2-bit quant, which is still quite capable) using SSD offloading, with just-about-reasonable performance for interactive use, and far better performance on longer contexts than Qwen (due to the more efficient KV cache/attention mechanisms in DeepSeek).

Very significant improvements may be viable for unattended inference via large-scale batches, which can reuse sparse experts and thereby mask some of the latency involved - this is quite unique to DeepSeek, again due to its efficient KV cache.

reply
Qwen 3.6 27B still curb stomps Deepseek V4 in coding
reply
1. Deepseek V4 is still in preview (training is not finished)

2. Qwen is much more demanding and borderline unusable on consumer hardware because it's a dense model. The 27B parameters are active all time for each token. It's not a MoE architecture where a router activates only some of them.

3. Qwen doesn't like quantization at all.

reply
I have to disagree with most claims. I run Qwen3.6-27b at 260k context and 40-60 tok/sec. It handles most coding problems as well as Sonnet 4.6 under OpenCode on our production tasks. (As an experiment, I run the same prompts for the same issues in parallel for Qwen 3.6 and Sonnet 4.6 and usually see little difference in performance). I see zero degradation from quantization in practice.

Settings: RTX 5090, 5-bit weights (Unsloth), FP8 KV cache.

Last time I tried running large MoEs on this PC, they had inferior quality at 2-3 bits compared to much smaller dense models at 5-6 bits, and were slower anyway.

reply
A 260k context (close to the stock maximum for Qwen, though it's possible to extend it) will take ~16GB RAM for storing the KV cache, barring quantization tricks which severely degrade quality. That's a whole lot more than what DeepSeek requires for a similar context length, and makes it infeasible to batch multiple inferences together. This used to be the status quo for consumer inference, in fact it still is for models like Kimi and GLM (which can sometimes be smarter than even DeepSeek V4 Pro!) but we can also do better nowadays.
reply
You can run the 35B A3B model which is an MoE. Runs great on a 5090.
reply
I've got a Qwen 3.5 running on a 12GB 3060 and it's dumb as a stump but still smart enough to get some useful work done. Since it's my daily driver desktop I havent jumped to 3.6 since last time I did I quickly ran out of vram and locked the desktop environment.

But yeah, the Qwen line is pretty impressive on commodity hardware.

reply
I must be using LLMs very differently than y'all, because I can't think of a single thing I would rely on an LLM that's "dumb as a stump" to do for me.

To me, LLMs are for asking research questions + exploring design spaces + pointing at codebases to investigate bugs. And those all benefit from the model being as "smart" (in terms of both fluid intelligence and burned-in knowledge) as possible.

I'm guessing there exist problems where "intelligence past a certain point" doesn't matter, so these medium-sized models can match the performance of the bigger models. But what problems might those be?

reply
Qwen suffers quantization a lot, rendering it borderline unusable.
reply