undefined

points

[-]

The 35b-a3b-coding-nvfp4 model has the recommended hyperparameters set for coding, not chatting. If you want to use it to chat you can pull the `35b-a3b-nvfp4` model (it doesn't need to re-download the weights again so it will pull quickly) which has the presence penalty turned on which will stop it from thinking so much. You can also try `/set nothink` in the CLI which will turn off thinking entirely.

by zozbot23412 hours ago|

prev|

[-]

> it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world".

Qwen thinking likes to second-guess itself a LOT when faced with simple/vague prompts like that. (I'll answer it this way. Generating output. Wait, I'll answer it that way. Generating output. Wait, I'll answer it this way... lather, rinse, repeat.) I suppose this is their version of "super smart fancy thinking mode". Try something more complex instead.

by drob51812 hours ago|

parent|

[-]

Indeed. Qwen doesn’t just second guess itself, it third and fourth guesses itself.

by Kichererbsen8 hours ago|

parent|

[-]

Solid Terry Pratchett reference right there.

by domh12 hours ago|

parent|

prev|

[-]

OK thanks! That's helpful. I ignorantly assumed simpler prompt == faster first response.

by functional_dev9 hours ago|

prev|

[-]

I did not know, that NVFP4 was handled at the silicon level... until I dug deeper here - https://vectree.io/c/llm-quantization-from-weights-to-bits-g...

by duffyjp6 hours ago|

parent|

[-]

I still don't think I understand it. I saw those nvfp4 models up by chance yesterday and tried them on my Linux PC with a 5060TI 16gb. Ollama refused to pull them saying they were macOS only.

I assumed it was a meta-data bug and posted an issue, but apparently nvfp4 doesn't necessarily mean nvidia-fp4.

https://github.com/ollama/ollama/issues/15149

by Patrick_Devine1 hours ago|

parent|

[-]

They are nvidia-fp4 weights, but CUDA support isn't _quite_ ready yet, but we've got that cooking.

by Octoth0rpe12 hours ago|

prev|

[-]

> it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world".

That's not an unsurprising result given the pretty ambiguous query, hence all the thinking. Asking "write a simple hello world program in python3" results in a much faster response for me (m4 base w/ 24gb, using qwen3.6:9b).

by fooker10 hours ago|

prev|

[-]

Avoid reasoning models in any situation where you have low tokens/second

by EagnaIonat9 hours ago|

prev|

[-]

When MLX comes out you will see a huge difference. I currently moved to LMStudio as it currently supports MLX.

by kylehotchkiss6 hours ago|

prev|

[-]

I made my M2 Max generate a biryani recipe for me last night with 64gb ram and the baseline qwen3.5:35b model. I used the newest ollama with MLX.

https://gist.github.com/kylehotchkiss/8f28e6c75f22a56e8d2d31...

Under 3 minutes to get all that. The thinking is amusing, my laptop got quite warm, but for a 35b model on nearly 4 year old hardware, I see the light. This is the future.

by xienze12 hours ago|

prev|

[-]

Well, two things. First, “hi” isn’t a good prompt for these thinking models. They’ll have an identity crisis trying to answer it. Stupid, but it’s how it is. Stick to real questions.

Second, for the best performance on a Mac you want to use an MLX model.

by domh12 hours ago|

parent|

[-]

Thanks! I assumed simpler == faster, but my ignorance is showing itself.

I am using the model they recommended in the blog post - which I assumed was using MLX?

by hbbio12 hours ago|

prev|

[-]

[dead]