It got rather tangled up when I tried it with one of my coding tests, which is a simple wordpress plugin, but I frustrate the model by asking it to write code for older PHP, break WP coding conventions and use a rather bespoke method for arranging code in objects. So it is sort of a hybrid of a green field and brown field task; a bit muddy.
It did not do as well as Qwen 3.6 35B, but the way it worked through its thoughts was interesting.
TBH I struggled to understand what DeepReinforce are doing that is materially different; the explanation of their training technique goes over my head at this point.
So for example I'd favour a used M1 Max over a used M2 Pro, at least based on my naïve understanding. Not quite sure where the balance changes.
There appear to be some hardware improvements with the M3 and up regarding the Apple Neural Engine which I'd hope would show up in MLX performance; I remember seeing some optimisations in image generation models that are only possible on later hardware.
The GPU cores are progressively better I believe, but the memory bandwidth is lower. Though perhaps the M4 can get closer to actually saturating said bandwidth.
(And I must reiterate that my understanding of this stuff is pretty naïve.)
A very useful resource for characteristics and comparative performance of all M variants, if anybody is interested, is https://github.com/ggml-org/llama.cpp/discussions/4167?sort=...
Its sister discussion for nvidia gpus is https://github.com/ggml-org/llama.cpp/discussions/15013
Note the drop in performance for the base (binned) m3 max version. You are better off with full m1 max than the binned m3 max, even price aside.
The issue I have with my m1 max is that with 64gb you cannot run really decent MoE models, ie the ones you can run like qwen 35B-A3B have only 3b active parameters and are much less capable than qwen 27b in my testing. So I end up running the 27b one, but it runs relatively slow (though still usable at 10-20 tok/s) and I would have been better off a used nvidia gpu setup for dense models. I assume 35B-A3B has its use cases, eg as subagents, just that I cannot find them. With a higher amount of ram I could probably run bigger MoE models which could be more comparable, though prefill would still be an issue (and prob a bigger one). The only hopeful thing is that there are performance hacks appearing (speculative decoding and prefill) that seem to start improving inference speed once getting implemented, so I am mildly hopeful.
(I must also iterate that my understanding is not very deep either)