upvote
Compared to a dynamic quant like Unsloth's UD-Q4_K_XL, which keeps some important parameters in higher precision, a basic NVFP4 quant seems to do a lot more damage to the model unless it is carefully calibrated.

I would recommend using llama-server if you're just on a single Spark. You get access to dynamic quants like that more easily, the performance is not that different from vLLM most of the time these days, and it is much faster and easier to switch between models.

As far as intelligence goes, Qwen3.6-27B is much smarter than the 35B-A3B model, but that's also not the sort of thing to argue with an AI model about in the first place. Just open a new chat and try again.

Gemma-4-31B is not as good at agentic use cases as Qwen3.6-27B, but it is a fairly balanced model overall, and worth trying out too. Its MTP can nearly triple the performance of the model, where the benefits of MTP or Eagle seem more limited for Qwen3.6-27B in my testing, maybe doubling the speed.

reply
Looping is a common problem with the Qwen models. I've had good luck using --repeat-penalty=1.1 with llama.cpp and 27B. vLLM should have a similar option.
reply
This is the default value!
reply
Llama.cpp defaults to 1.0 (disabled) and so does vLLM. It looks like only ollama defaults to 1.1.
reply
deleted
reply
There are also nvfp4 quants of Qwen 3.6 27/35 floating around. I've done benchmarks of both and the quality difference vs fp8/bf16 was barely notable. Honestly the nvfp4 capability is the most interesting feature of the Spark (at least for me).
reply
I use Qwen 3.6 35B-A3B constantly, but I don’t see the type of behavior you mentioned. I’m using Unsloth’s Q8_K_XL quant.
reply
`llama-server` looping mitigations --repeat-penalty something greater than 1.0, set reasoning/thinking OFF explicitly, prefer a gguf with more than 4bit quant
reply