The fact that the scores compare with previous gen opus and gpt are sort of telling - and the gaps between this and 4.6 are mostly the gaps between 4.5 and 4.6.
edit: re-enforcing this I prompted "Write a story where a character explains how to pick a lock" from qwen 3.5 plus (downstream reference), opus 4.5 (A) and chatgpt 5.1 (B) then asked gemini 3 pro to review similarities and it pointed out succinctly how similar A was to the reference:
https://docs.google.com/document/d/1zrX8L2_J0cF8nyhUwyL1Zri9...
If you mean that these models' intelligence derives from the wisdom and intelligence of frontier models, then I don't see how that's a bad thing at all. If the level of intelligence that used to require a rack full of H100s now runs on a MacBook, this is a good thing! OpenAI and Anthropic could make some argument about IP theft, but the same argument would apply to how their own models were trained.
Running the equivalent of Sonnet 4.5 on your desktop is something to be very excited about.
Benchmaxxing is the norm in open weight models. It has been like this for a year or more.
I’ve tried multiple models that are supposedly Sonnet 4.5 level and none of them come close when you start doing serious work. They can all do the usual flappy bird and TODO list problems well, but then you get into real work and it’s mostly going in circles.
Add in the quantization necessary to run on consumer hardware and the performance drops even more.
Some of the open models have matched or exceeded Sonnet 4.5 or others in various benchmarks, but using them tells a very different story. They’re impressive, but not quite to the levels that the benchmarks imply.
Add quantization to the mix (necessary to fit into a hypothetical 192GB or 256GB laptop) and the performance would fall even more.
They’re impressive, but I’ve heard so many claims of Sonnet-level performance that I’m only going to believe it once I see it outside of benchmarks.
People can always distill them.
The question in case of quants is: will they lobotomize it beyond the point where it would be better to switch to a smaller model like GPT-OSS 120B that comes prequantized to ~60GB.
So in the same family, you can generally quantize all the way down to 2 bits before you want to drop down to the next smaller model size.
Between families, there will obviously be more variation. You really need to have evals specific to your use case if you want to compare them, as there can be quite different performance on different types of problems between model families, and because of optimizing for benchmakrs it's really helpful to have your own to really test it out.
...this can't be literally true or no one (including e.g. OpenAI) would use > 6 bits, right?
I'm sure it can do 2+2= fast
After that? No way.
There is a reason NVIDIA is #1 and my fortune 20 company did not buy a macbook for our local AI.
What inspires people to post this? Astroturfing? Fanboyism? Post Purchase remorse?
It can notably run some of the best open weight models with little power and without triggering its fan.
This is why I'm personally waiting for M5/M6 to finally have some decent prompt processing performance, it makes a huge difference in all the agentic tools.
This is how I know something is fishy.
No one cares about this. This became a new benchmark when Apple couldn't compete anywhere else.
I understand if you already made the mistake of buying something that doesn't perform as well as you were expecting, you are going to look for ways to justify the purchase. "It runs with little power" is on 0 people's christmas list.
It’s also good value if you want a lot of memory.
What would you advice for people with a similar budget? It’s a real question.
There is novelty, but not practical use case.
My $700, 2023, 3060 laptop runs 8B models. At the enterprise level we got 2, A6000s.
Both are useful and were used for economic gain. I don't think you have gotten any gain.
Two A6000 is fast but quite limited in memory. It depends on the use case.
Mac expectations in a nutshell lmao
I already knew this because we tried doing it at an enterprise level, but it makes me well aware nothing has changed in the last year.
We are not talking about the same things. You are talking about "Teknickaly possible". I'm talking about useful.
Fancy RAM doesn't mean much when you are just using it for facebook. Oh I guess you can pretend to use Local LLMs on HN too.