undefined

points

[-]

I worked extensively on ARC AGI before and one thing is SURE as hell. OpenAI and Gemini in particular use this as marketing material. You can correlate the benchmark release with stock price increase. They feed synthetic datasets of ARC into their models to boost the numbers. There is no doubt in my mind Gemini is no better than DeepSeek other than being specifically fine tuned for ARC AGI. Heck, they even say so and they say they have paid annotations for ARC. Again, economic incentives. In terms of whether these models are actually better at the benchmarks, likely not. See ARC 3, where the gap is diminishingly small.

by versteegen2 hours ago|

parent|

[-]

I've also worked extensively on ARC AGI 1/2, and I mainly agree. Marketing and training. Performance of LLMs on ARC is most importantly a function of training on grid/table-like data. It doesn't have to be specifically synthetic ARC data though. Training an LLM to be better at perceiving grid-like arrangements of data in a spatial way like an image, rather than just tabular, is hugely useful for things outside of ARC benchmarks, though it's a narrow skill. Hence, I'm sure they do it. I want them to do that. I believe the labs when they say they didn't train specifically for ARC-AGI 1/2 (where did Google say otherwise? I don't see it). But it does not mean the models are getting better at general purpose reasoning. They were already plenty good enough at that. You can describe ARC images in words and reason about it using a level of intelligence LLMs have had for years: they're designed to be easy! LLMs just couldn't reason about image-like grids very well.

by gpt54 hours ago|

parent|

prev|

[-]

ARC-AGI isn't perfect, but it helps demonstrates the gap. I'm sure all companies optimize their models for this benchmark given its dominance.

by energy1234 hours ago|

parent|

prev|

[-]

Why do you think DeepSeek isn't also fine tuned on ARC AGI? Maybe they're more fine tuned on ARC AGI but still get worse scores. There's no way to know.

by usernametaken293 hours ago|

parent|

[-]

My gut feeling is that ARC doesn’t play as big of a role in the Chinese model manufacturer landscape. It’s one byproduct but China is focusing on resource efficiency (for political reasons and low compute). So unlike OpenAI, poor performance on ARC doesn’t hurt as much if the model works well. OpenAI literally hinges on hype so the insane economic bets they make somehow pay off. If you have billions and the future of the company on the line, you ace the exam any way you can. We noticed this early on that whenever some dataset of ARC was released suddenly the classes of problems in that dataset GPT would do well on. But it just doesn’t generalise. They fine tune like crazy. I bet they fine tune for raspberry counting at this point. Again, for OpenAI the perception of moat is everything! Keep that in mind

by zozbot2343 hours ago|

parent|

[-]

True, ARC is mostly an artificial "human-like AGI" benchmark that doesn't really reflect any plausible workload. Very different from things like Humanity's Last Exam that reflect real-world knowledge and are now getting closer and closer to saturation even with open models.

by applfanboysbgon5 hours ago|

prev|

[-]

> Deep seek 3.2 is 4% on Arc-AGI 2

Why are you bringing up an outdated Chinese model from 6 months ago to compare to a US model from 6 months ago? The outdated Chinese model will have performance from ~12 months ago, obviously. But today's Chinese model DeepSeek 4 has performance not far from the US model 6 months ago; 46% compared to 52% from 5.2.

by gpt54 hours ago|

parent|

[-]

Because Deepseek 4.0 is not yet there, but the jump isn't expected to be large. Kimi 2.5 is there and is also scoring low.

by DCKing3 hours ago|

parent|

[-]

Deepseek V4 came out three weeks ago: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro

Kimi K2.5 has also been superseded by a finer tuned Kimi K2.6 three weeks ago. Moonshot's Kimi models appear to be the favored Chinese model, at least for coding, and not Deepseek V4. z.AI's GLM 5.1 is also worth mentioning as rather competent for coding, also released in April.

Those models too will not be beating US AI labs by your metrics (although for coding, Kimi K2.6 might beat the very uneven Gemini depending on the situation), but in your critism at least consider the state of the art in your comparisons.

by KronisLV7 minutes ago|

parent|

[-]

Also they have a pretty big token discount running this month: https://api-docs.deepseek.com/quick_start/pricing/

Even without the discount, I'll have to think about whether I need the 100 EUR tier of Anthropic Max, or whether downgrading to Pro and using DeepSeek is good enough. And they're also up on OpenRouter and other places.

Been using those models, not quite comparable with Opus 4.6/4.7 but with max reasoning, pretty good for a variety of dev tasks! Only big problem is no ability to process images, so can't really do browser use for some semi-automated testing, I'd have to write Playwright tests even when I don't want to.

by noisy_boy12 minutes ago|

parent|

prev|

[-]

I have been using Deepseek v4 pro for personal projects and home infra related work for last couple of weeks. It's quality of work is not bad at all, it is fairly fast and given the fraction of the cost compared to Claude, I can keep going which makes it a very compelling option. Looking forward to trying out Kimi 2.6, thanks for the recommendation.

by pjerem4 hours ago|

parent|

prev|

[-]

Hum, I'm using it [0] with my Ollama Cloud subscription since the last two weeks and I love it. Never reached the 5 hours usage limits of the $20 plan (on side projects) where I would reach it sometimes in ONE prompt with Opus.

[0]: https://ollama.com/library/deepseek-v4-pro

by sho6 hours ago|

prev|

[-]

I 100% agree with you, but I've been convinced over the last year that it's a time and scale issue, not anything fundamental.

The Chinese models right now are in a weird spot. Compared to the frontiers, both their pre and post training is woeful - tiny, resource constrained in every dimension including human, slow. I'd compare it to OpenAI 5 years ago except I think even then OpenAI had way more!

But they "cheat" quite a lot in distillation and very benchmark-focussed RL and that's where you get this superficial quality in the leaderboards that doesn't match up when you go off-script. Arc is a great example in that it really belies an "inferior soul" at the heart of it all.

What gives me great hope though is that those same scaling laws that Altman and others have been hyping forever will absolutely kick in for the Chinese labs just as they did for the US ones, and I don't think anything can stop that process now. So they will catch up. It won't be tomorrow, but it's not going to be 10 years either. 3-5 would be my reasonably educated guess.

And the final risk, that China itself might try to restrict availability of the tsunami of GPU or other AI hardware it will inevitably produce - well, I just can't really imagine a country that has been configuring itself for the last 40 years as a single purpose export machine deciding that actually, no, it doesn't want to export something.

About the model restrictions - absolutely. I've been trying to do security research on my own software and the frontier models immediately get suspicious. I've been playing with the local ones much more this year basically because of this. They have deficiencies, for sure - they feel very "hollow" compared to the major labs. But I've talked to a lot of people, and the consensus is pretty clear - just a matter of time.

by flir4 hours ago|

parent|

[-]

Just an observation: constraints often result in creative solutions. I wouldn't be surprised if a smaller lab makes a big breakthrough because they have to.

by ageitgey5 hours ago|

prev|

[-]

Have you tried the latest DeepSeek v4 Pro inside of the Claude Code harness? It's not listed in that site.

It definitely 'feels like' it is as good as Claude for many regular web app coding tasks (though I don't have real benchmarks). And it is comically cheap.

I'm not suggesting it is better than the latest Claude or codex models, but it seems 'good enough' for a lot of use cases in my limited real world testing.

by PAndreew4 hours ago|

parent|

[-]

I'm starting to feel like a parrot, but people seem to forget that software engineering is actually a very narrow slice of the white collar pie. You don't need a mega-model which can reason about 100 000 lines of code when you want to create a nice PPT (which consumed literally hours of your life before) to impress your boss. SOTA models will probably be used for frontier research, complex coding tasks, large scale data analysis, etc. And the average Joe shall be able to buy a pre-configured box with a plug-and-play harness and run medium models air-gapped. Or use such models through cloud APIs dirt cheap if privacy is not a concern.

by ageitgey3 hours ago|

parent|

[-]

On the same topic but from a slightly different angle - as SOTA models get more capable, the 'quality' and 'feel' of the experience they provide in each domain is heavily dependent on the reinforcement learning the vendor does for that specific domain. After all, many fields have 100 flavors of "good answers," but the model has to pick one answer.

Benchmarks are not very good at capturing this yet. But it could be the case that DeepSeek v4 Pro is 100% as good as Claude Opus 4.7 at scaffolding a basic Rails app, but absolutely terrible at creating a credible business plan that another businessperson would think is real. That's a made-up example, but you get the point.

The end result will be a lot of people arguing about which model is "better," but "better" depends heavily on the task and how that model was trained to interact with the user for that task. Two users may have very different qualitative experiences using the exact same model, despite the benchmarks.

by zozbot2344 hours ago|

parent|

prev|

[-]

Creating a nice PPT is actually hard because it requires visual capabilities and so-called "computer use" (really, GUI use) of fiddly proprietary software. The nice thing about the coding case compared to a lot of disparate white-collar work is that it's all plain ASCII text. You can already ask a coding model to create a nice TeX/beamer slideshow (or whatever the Typst-based equivalent is) but whether your boss will be duly impressed by that is anyone's guess.

by m_mueller2 hours ago|

parent|

[-]

Tangential, but in our opinion corporate PPTX automation is an unsolved problem, even with Claude for PowerPoint (and it's worse with everything else common out there). Its harness (a) is not tuned very well for corporate use and (b) even if it were, fails to manage the specific business knowledge within each org needed to create effective (i.e. audience tailored) presentations.

I've just written a blog post about this topic this week: https://octigen.com/blog/posts/2026-05-11-ai-presentation-ga...

by nimonian3 hours ago|

parent|

prev|

[-]

This is a tangent but I'd also mention sli.dev -- slideshow-as-website is really great and fun to make with llms

by omnimus5 hours ago|

parent|

prev|

[-]

Also so many developers i know use LLMs for one shoting isolated problems, explainers, discussions and planning. For these even Kimi is pretty great.

I don't think every dev will be comfortable just releasing claude on their project.

by energy1234 hours ago|

prev|

[-]

They're not even that much cheaper (1/2 price per task according to Artificial Analysis) once you account for lower token usage of GPT-5.5. I can't justify it when factoring in the extra time wasted, and the cheap codex usage I get through the monthly plan. Frontier intelligence is not a commodity product ... yet.

by irthomasthomas2 hours ago|

prev|

[-]

Arc has no predictive power whatsoever. I always use the best models available. So far I haven't found a task that chineses models cannot solve very quickly and reasonably. Do you have any examples where they failed for you?

by otabdeveloper45 hours ago|

prev|

[-]

And yet Claude six months ago was amazing and good enough for you.

This shows that AI cloud consumption is just a conspicuous consumption status symbol, nobody knows why they need cloud AI or what problem they are even solving.