undefined

points

by dofm1 days ago |

comments

by simonw1 days ago|

[-]

I've been trying Ornith 1.0 35B, I'm pretty impressed with it: https://simonwillison.net/2026/Jun/29/ornith/

by dofm1 days ago|

parent|

[-]

It's the one I have loaded right now.

It got rather tangled up when I tried it with one of my coding tests, which is a simple wordpress plugin, but I frustrate the model by asking it to write code for older PHP, break WP coding conventions and use a rather bespoke method for arranging code in objects. So it is sort of a hybrid of a green field and brown field task; a bit muddy.

It did not do as well as Qwen 3.6 35B, but the way it worked through its thoughts was interesting.

TBH I struggled to understand what DeepReinforce are doing that is materially different; the explanation of their training technique goes over my head at this point.

by jensC1 days ago|

parent|

prev|

[-]

It is also available with Ollama now and I am equally impressed too.

by rhgraysonii1 days ago|

prev|

[-]

Thanks! I was thinking of doing the 128gb to have some future proofing. I figure at this point, it's akin to a mechanic keeping great tools around, when it comes to having this sort of homelab and exposing it for your own uses. And great practice for building the next era of user facing computing that will be around as this proliferates.

by dofm1 days ago|

parent|

[-]

I would not buy a 64GB model again, probably, if this were to remain particularly important to me. But I gather memory bandwidth is pretty important here.

So for example I'd favour a used M1 Max over a used M2 Pro, at least based on my naïve understanding. Not quite sure where the balance changes.

There appear to be some hardware improvements with the M3 and up regarding the Apple Neural Engine which I'd hope would show up in MLX performance; I remember seeing some optimisations in image generation models that are only possible on later hardware.

The GPU cores are progressively better I believe, but the memory bandwidth is lower. Though perhaps the M4 can get closer to actually saturating said bandwidth.

(And I must reiterate that my understanding of this stuff is pretty naïve.)

by freehorse1 days ago|

parent|

[-]

Used M1 max is still a good choice because its memory bandwidth only got surpassed by generation m4 and later (except with ultra variants which are more expensive). Its prefill speed is not great though, and that is an issue for running larger contexts, which only substantially improved with m5. Moreover, up to m3 they only have thunderbolt 4, not 5, which means that they lack RDMA support which would make stacking machines more effective. So unless you go higher price for m4+ max, or any m ultra, m1 max is pretty decent still compared to m2 and m3 max, definitely better than pro variants, if you can find in a decent price and want to experiment without caring much about time to first token and large contexts.

A very useful resource for characteristics and comparative performance of all M variants, if anybody is interested, is https://github.com/ggml-org/llama.cpp/discussions/4167?sort=...

Its sister discussion for nvidia gpus is https://github.com/ggml-org/llama.cpp/discussions/15013

Note the drop in performance for the base (binned) m3 max version. You are better off with full m1 max than the binned m3 max, even price aside.

The issue I have with my m1 max is that with 64gb you cannot run really decent MoE models, ie the ones you can run like qwen 35B-A3B have only 3b active parameters and are much less capable than qwen 27b in my testing. So I end up running the 27b one, but it runs relatively slow (though still usable at 10-20 tok/s) and I would have been better off a used nvidia gpu setup for dense models. I assume 35B-A3B has its use cases, eg as subagents, just that I cannot find them. With a higher amount of ram I could probably run bigger MoE models which could be more comparable, though prefill would still be an issue (and prob a bigger one). The only hopeful thing is that there are performance hacks appearing (speculative decoding and prefill) that seem to start improving inference speed once getting implemented, so I am mildly hopeful.

(I must also iterate that my understanding is not very deep either)

by dofm1 days ago|

parent|

[-]

Good reply, those two links are v. useful and I had missed them.