upvote
Yeah I wouldn't get too excited. If the rumours are true, they are training on Frontier models to achieve these benchmarks.
reply
They were all stealing from past internet and writers, why is it a problem they stealing from each other.
reply
Nobody is saying it's a problem.
reply
This. Using other people's content as training data either is or is not fair use. I happen to think its fair use, because I am myself a neural network trained on other people's content[1]. But, that goes in both directions.

1: https://xkcd.com/2173/

reply
because dario doesnt like it
reply
I think this is the case for almost all of these models - for a while kimi k2.5 was responding that it was claude/opus. Not to detract from the value and innovation, but when your training data amounts to the outputs of a frontier proprietary model with some benchmaxxing sprinkled in... it's hard to make the case that you're overtaking the competition.

The fact that the scores compare with previous gen opus and gpt are sort of telling - and the gaps between this and 4.6 are mostly the gaps between 4.5 and 4.6.

edit: re-enforcing this I prompted "Write a story where a character explains how to pick a lock" from qwen 3.5 plus (downstream reference), opus 4.5 (A) and chatgpt 5.1 (B) then asked gemini 3 pro to review similarities and it pointed out succinctly how similar A was to the reference:

https://docs.google.com/document/d/1zrX8L2_J0cF8nyhUwyL1Zri9...

reply
They are making legit architectural and training advances in their releases. They don't have the huge data caches that the american labs built up before people started locking down their data, and they don't (yet) have the huge budgets the American labs have for post training, so it's only natural to do data augmentation. Now that capital allocation is being accelerated for AI labs in China, I expect Chinese models to start leapfrogging to #2 overall regularly. #1 will likely always be OpenAI or Anthropic (for the next 2-3 years at least), but well timed releases from Z.AI or Moonshot have a very good chance to hold second place for a month or two.
reply
Why does it matter if it can maintain parity with just 6 months old frontier models?
reply
But it doesn't except on certain benchmarks that likely involves overfitting. Open source models are nowhere to be seen on ARC-AGI. Nothing above 11% on ARC-AGI 1. https://x.com/GregKamradt/status/1948454001886003328
reply
Have you ever used an open model for a bit? I am not saying they are not benchmaxxing but they really do work well and are only getting better.
reply
I have used a lot of them. They’re impressive for open weights, but the benchmaxxing becomes obvious. They don’t compare to the frontier models (yet) even when the benchmarks show them coming close.
reply
This could be a good thing. ARC-AGI has become a target for America labs to train on. But there is no evidence that improvements on ARC performance translate to other skills. In fact there is some evidence that it hurts performance. When openai trained a version of o1 on ARC it got worse at everything else.
reply
That's a link from July of 2025, so, definitely not about the current releaase.
reply
...which conveniently avoids testing on this benchmark. A fresh account just to post on this thread is also suspect.
reply
Has the difference between performance in "regular benchmarks" and ARC-AGI been a good predictor of how good models "really are"? Like if a model is great in regular benchmarks and terrible in ARC-AGI, does that tell us anything about the model other than "it's maybe benchmaxxed" or "it's not ARC-AGI benchmaxxed"?
reply
GPT 4o was also terrible at ARC AGI, but it's one of the most loved models of the last few years. Honestly, I'm a huge fan of the ARC AGI series of benchmarks, but I don't believe it corresponds directly to the types of qualities that most people assess whenever using LLMs.
reply
It was terrible at a lot of things, it was beloved because when you say "I think I'm the reincarnation of Jesus Christ" it will tell you "You know what... I think I believe it! I genuinely think you're the kind of person that appears once every few millenia to reshape the world!"
reply
because arc agi involves de novo reasoning over a restricted and (hopefully) unpretrained territory, in 2d space. not many people use LLMs as more than a better wikipedia,stack overflow, or autocomplete....
reply
If you mean that they're benchmaxing these models, then that's disappointing. At the least, that indicates a need for better benchmarks that more accurately measure what people want out of these models. Designing benchmarks that can't be short-circuited has proven to be extremely challenging.

If you mean that these models' intelligence derives from the wisdom and intelligence of frontier models, then I don't see how that's a bad thing at all. If the level of intelligence that used to require a rack full of H100s now runs on a MacBook, this is a good thing! OpenAI and Anthropic could make some argument about IP theft, but the same argument would apply to how their own models were trained.

Running the equivalent of Sonnet 4.5 on your desktop is something to be very excited about.

reply
> If you mean that they're benchmaxing these models, then that's disappointing

Benchmaxxing is the norm in open weight models. It has been like this for a year or more.

I’ve tried multiple models that are supposedly Sonnet 4.5 level and none of them come close when you start doing serious work. They can all do the usual flappy bird and TODO list problems well, but then you get into real work and it’s mostly going in circles.

Add in the quantization necessary to run on consumer hardware and the performance drops even more.

reply
Anyone who has spent any appreciable amount of time playing any online game with players in China, or dealt with amazon review shenanigans, is well aware that China doesn't culturally view cheating-to-get-ahead the same way the west does.
reply
I’m still waiting for real world results that match Sonnet 4.5.

Some of the open models have matched or exceeded Sonnet 4.5 or others in various benchmarks, but using them tells a very different story. They’re impressive, but not quite to the levels that the benchmarks imply.

Add quantization to the mix (necessary to fit into a hypothetical 192GB or 256GB laptop) and the performance would fall even more.

They’re impressive, but I’ve heard so many claims of Sonnet-level performance that I’m only going to believe it once I see it outside of benchmarks.

reply
I hope China keeps making big open weights models. I'm not excited about local models. I want to run hosted open weights models on server GPUs.

People can always distill them.

reply
Theyll keep releasing them until they overtake the market or the govt loses interest. Alibaba probably has staying power but not companies like deepseek's owner
reply
Will 2026 M5 MacBook come with 390+GB of RAM?
reply
Quants will push it below 256GB without completely lobotomizing it.
reply
> without completely lobotomizing it

The question in case of quants is: will they lobotomize it beyond the point where it would be better to switch to a smaller model like GPT-OSS 120B that comes prequantized to ~60GB.

reply
In general, quantizing down to 6 bits gives no measurable loss in performance. Down to 4 bits gives small measurable loss in performance. It starts dropping faster at 3 bits, and at 1 bit it can fall below the performance of the next smaller model in the family (where families tend to have model sizes at factors of 4 in number of parameters)

So in the same family, you can generally quantize all the way down to 2 bits before you want to drop down to the next smaller model size.

Between families, there will obviously be more variation. You really need to have evals specific to your use case if you want to compare them, as there can be quite different performance on different types of problems between model families, and because of optimizing for benchmakrs it's really helpful to have your own to really test it out.

reply
> In general, quantizing down to 6 bits gives no measurable loss in performance.

...this can't be literally true or no one (including e.g. OpenAI) would use > 6 bits, right?

reply
Did you run say SWE Bench Verified? Where does this claim coming from? It's just an urban legend.
reply
Most certainly not, but the Unsloth MLX fits 256GB.
reply
Curious what the prefilled and token generation speed is. Apple hardware already seem embarrassingly slow for the prefill step, and OK with the token generation, but that's with way smaller models (1/4 size), so at this size? Might fit, but guessing it might be all but usable sadly.
reply
They're claiming 20+tps inference on a macbook with the unsloth quant.
reply
Yeah, I'm guessing the Mac users still aren't very fond of sharing the time the prefill takes, still. They usually only share the tok/s output, never the input.
reply
My hope is the Chinese will also soon release their own GPU for a reasonable price.
reply
'fast'

I'm sure it can do 2+2= fast

After that? No way.

There is a reason NVIDIA is #1 and my fortune 20 company did not buy a macbook for our local AI.

What inspires people to post this? Astroturfing? Fanboyism? Post Purchase remorse?

reply
I have a Mac Studio m3 ultra on my desk, and a user account on a HPC full of NVIDIA GH200. I use both and the Mac has its purpose.

It can notably run some of the best open weight models with little power and without triggering its fan.

reply
It can run and the token generation is fast enough, but the prompt processing is so slow that it makes them next to useless. That is the case with my M3 Pro at least, compared to the RTX I have on my Windows machine.

This is why I'm personally waiting for M5/M6 to finally have some decent prompt processing performance, it makes a huge difference in all the agentic tools.

reply
Just add a DGX Spark for token prefill and stream it to M3 using Exo. M5 Ultra should have about the same compute as DGX Spark for FP4 and you don't have to wait until Apple releases it. Also, a 128GB "appliance" like that is now "super cheap" given the RAM prices and this won't last long.
reply
>with little power and without triggering its fan.

This is how I know something is fishy.

No one cares about this. This became a new benchmark when Apple couldn't compete anywhere else.

I understand if you already made the mistake of buying something that doesn't perform as well as you were expecting, you are going to look for ways to justify the purchase. "It runs with little power" is on 0 people's christmas list.

reply
It was for my team. Running useful LLMs on battery power is neat for example. Some simply care a bit about sustainability.

It’s also good value if you want a lot of memory.

What would you advice for people with a similar budget? It’s a real question.

reply
But you arent really running LLMs. You just say you are.

There is novelty, but not practical use case.

My $700, 2023, 3060 laptop runs 8B models. At the enterprise level we got 2, A6000s.

Both are useful and were used for economic gain. I don't think you have gotten any gain.

reply
Yes a good phone can run a quantised 8B too.

Two A6000 is fast but quite limited in memory. It depends on the use case.

reply
>Yes a good phone can run a quantised 8B too.

Mac expectations in a nutshell lmao

I already knew this because we tried doing it at an enterprise level, but it makes me well aware nothing has changed in the last year.

We are not talking about the same things. You are talking about "Teknickaly possible". I'm talking about useful.

reply
If you are happy with 96GB of memory, nice for you.
reply
I use my local AI, so: yes very much.

Fancy RAM doesn't mean much when you are just using it for facebook. Oh I guess you can pretend to use Local LLMs on HN too.

reply
[dead]
reply