upvote
The 35b-a3b-coding-nvfp4 model has the recommended hyperparameters set for coding, not chatting. If you want to use it to chat you can pull the `35b-a3b-nvfp4` model (it doesn't need to re-download the weights again so it will pull quickly) which has the presence penalty turned on which will stop it from thinking so much. You can also try `/set nothink` in the CLI which will turn off thinking entirely.
reply
> it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world".

Qwen thinking likes to second-guess itself a LOT when faced with simple/vague prompts like that. (I'll answer it this way. Generating output. Wait, I'll answer it that way. Generating output. Wait, I'll answer it this way... lather, rinse, repeat.) I suppose this is their version of "super smart fancy thinking mode". Try something more complex instead.

reply
Indeed. Qwen doesn’t just second guess itself, it third and fourth guesses itself.
reply
Solid Terry Pratchett reference right there.
reply
OK thanks! That's helpful. I ignorantly assumed simpler prompt == faster first response.
reply
I did not know, that NVFP4 was handled at the silicon level... until I dug deeper here - https://vectree.io/c/llm-quantization-from-weights-to-bits-g...
reply
I still don't think I understand it. I saw those nvfp4 models up by chance yesterday and tried them on my Linux PC with a 5060TI 16gb. Ollama refused to pull them saying they were macOS only.

I assumed it was a meta-data bug and posted an issue, but apparently nvfp4 doesn't necessarily mean nvidia-fp4.

https://github.com/ollama/ollama/issues/15149

reply
They are nvidia-fp4 weights, but CUDA support isn't _quite_ ready yet, but we've got that cooking.
reply
> it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world".

That's not an unsurprising result given the pretty ambiguous query, hence all the thinking. Asking "write a simple hello world program in python3" results in a much faster response for me (m4 base w/ 24gb, using qwen3.6:9b).

reply
Avoid reasoning models in any situation where you have low tokens/second
reply
When MLX comes out you will see a huge difference. I currently moved to LMStudio as it currently supports MLX.
reply
I made my M2 Max generate a biryani recipe for me last night with 64gb ram and the baseline qwen3.5:35b model. I used the newest ollama with MLX.

https://gist.github.com/kylehotchkiss/8f28e6c75f22a56e8d2d31...

Under 3 minutes to get all that. The thinking is amusing, my laptop got quite warm, but for a 35b model on nearly 4 year old hardware, I see the light. This is the future.

reply
Well, two things. First, “hi” isn’t a good prompt for these thinking models. They’ll have an identity crisis trying to answer it. Stupid, but it’s how it is. Stick to real questions.

Second, for the best performance on a Mac you want to use an MLX model.

reply
Thanks! I assumed simpler == faster, but my ignorance is showing itself.

I am using the model they recommended in the blog post - which I assumed was using MLX?

reply
[dead]
reply