You don't know what's happening in z.ai nor alibaba. And you don't know what's happening in anthropic and open ai.
I don't know what they are all doing, but I find it extremely unlikely that they are not all collecting data from one another. I am confident anthropic has a team going over GML 5.2 weights even if it's just to see where the competition is.
Just because some labs are getting data from Anthropic does not mean they are not also doing their own research.
They were focused on optimization because they could not get the best hardware.The only reason their top labs are behind may be because they did not have h200s and MI350s. And now they do.
Plus you are discounting other risks, Anthropic is currently sitting on "the best" models in the world because they got in a pissing match with the US administration.
btw: This could be the case in china as well, their administration has been surprisingly open on AI exports and open weight models, that we know of. There is a very small but not trivial chance they are hogging a better version of glm 5.2 for example, but no one is allowed to talk about it. Now I am not saying that is the case, I am saying the two cases (chinese labs are 6 months behind, they are forced to suppress their best models) are indistinguishable.
Even if your characterization is accurate, they could do this tomorrow and are not so myopic that they wouldn’t have thought about it. I don’t see this as a barrier, and I see a lot of the same underestimation of Asia that’s been happening for 50 years. There’s not some innate American advantage to building LLMs, and personally I think whatever head start the US has is going to be squandered on delays from the export control “to dangerous for release” LARPing we’re seeing.
Also I was responding to a claim about what will happen in less than 6 months (that’s about the edge of what you can meaningfully say too much about in this field).
These strategies take materially different resources; it’s not an overnight decision made by leadership. I suppose there is a natural experiment ongoing at Meta regarding this, it seems they recently moved a number of people into a division to produce such data overnight. So we will find out soon how quick they climb the leaderboards.
Distilling even with small amounts of data from a better model is still helpful, but not in the sense of transferring capabilities the raw internet-trained model doesn't have at all, but for identifying those capabilities that are compatible with the servile assistant persona and suppressing others that are undesirable (e.g. trolling). A primitive version of this were instruction-tuning datasets generated with ChatGPT, as used e.g. for Alpaca.
Without a clear target to emulate, competitors might have to rely more on human raters, but there are plenty of data labeling companies in China, so that's hardly a hurdle.
Distillation and copying are how they’ve bootstrapped their models, but that feels not so different than Anthropic and Meta torrenting millions of pirated books.
The Chinese labs are solving problems for a different set of constraints.
The use of US models for Chinese model training is part of the motivation of all of this.
But if they can stay on pace, within say 6 to 12 months of the bleeding edge of the American frontier models, that’s a huge problem.
If they can just piggyback on the Herculean efforts of Anthropic, OpenAI, Google etc., accept a little bit of lag, and save billions of dollars? Why wouldn’t they?
And for the end user, why would they pay a premium subscription price for something they can just wait six months for and run on their own hardware at home? In my opinion, this is the cat and mouse game that’s being played right now. And I suspect it’s intentional on the side of the open weight models. I would bet they are playing a war of attrition
They don't even need to 'win' in the sense of maxing the benchmark. They can be 20% worse/50% cheaper and many of us (and our managers who approve our token budgets) will be in.
Deepseek is 30x cheaper for input/75x cheaper for output than sonnet on openrouter, and it's not a whole lot worse for many things.
It is enough to kneecap their pricing power to trigger the valuation reset by an order of magnitude and humble them a bit.
Plus there are always infrastructure and hardware providers who want to keep their share of profits and will squeeze Anthropic's margins to deflate their valuation (nvidia, aws, RAM manufacturers, etc)
1. It's unclear if there is a law of diminishing returns with ever-larger models. They're more expensive to run and for many applications, you'll probably find smaller models are sufficient;
2. There's an inbuilt market for local LLMs. This is an effective limit on how large models can get. Case law hasn't been established yet on, for example, if a law firm using ChatGPT breaks privilege. Specifically, chat logs may be discoverable. Medical applications have this issue too and I think you'll find that financial firms are going to be leery about this as well;
3. Better, larger models will bleed into smaller, open source models. The chat logs themselves are training data. There's a whole market in China for Claude tokens around this;
4. China has a national security interest in not being beholden to US tech giants when it comes to AI. China has a history of being able to commit to large-scale long-term projects and Anthropic just won't be able to compete with a national project by one of the world's superpowers, if it comes down to it;
5. Winning doesn't necessarily mean being the best. Often it's just being good enough;
6. As an example of a national project, China is busy replicating EUV because of the US ban on ASML and NVidia exporting their best stuff. I don't think many in the West are prepared for how rapid this will be. I'm reminded of the policy debate in 1945 when many in American policy and militarey circles thought the USSR would never catch up with atomic bomb or, if they did, it would take 20+ years. It took 4 years. For the hydrogen bomb, it took 1. The US hardware advantage is a lot more tenuous than many realize.
Kind of an oxymoron don’t you think.
If they could generate data that looked kind of real, why don’t they just generate that data on the fly during inference