upvote
Yes, challenger Labs publish out of necessity. It is a marketing strategy. People assuming open source means giving something up, but the reality is that Z.ai has a revenue of some $100M and it would be about $0M if they never open sourced their models.
reply
Wouldn’t that just help the American labs anyway though? Or do they assume they’ve actually already figured this stuff out and kept it secret?
reply
It used to be the case that NSA hired the majority of all math graduates in the US, and were assumed to be years ahead in cryptography. Yet in the 90s, it became clear that they no longer were that - among other things, the cipher of the notorious Clipper chip was broken, and we can rule out that it was made weak on purpose because the whole point of Clipper was that they had a backdoor.

So, despite hiring the cream of the crop of math graduates, who could read the papers of free academia, but whose own result the free world could not access - they fell behind.

I have a theory explaining why. I think it's because science is an interactive process. NSA cryptographers could read papers, but they couldn't talk openly with the authors of those papers, because of secrecy demands - even asking question might indicate what they were working on. You can easily imagine them spending months on something they could have avoided by going to the original authors and getting told "Oh, we tried that for a long time, it doesn't work".

Whether that theory is right or not, cryptography is a concrete example of a domain where public research with fewer resources beat private research with a lot more resources.

reply
Everyone in this thread is getting distracted by nationalism, but you hit the nail on the head. In this case for whatever reason the Chinese AI industry is collaborative and the American AI industry is not. This will result in the Chinese companies making progress faster. Full stop. This isn't a judgement on the merits of either system, only an observation of likely results.
reply
Hasn't that been the mantra of open source for 40 years. Armies of companies, trillions of valuation, or even just Wayland, suggest that isn't always the case.
reply
So free software can only be considered a successful strategy if every single project succeeds?
reply
Reminds me of Dot Net in the early 2000-2012... No one collaborated
reply
From what I gather, the Chinese are behind, but a lot of their research amounts to scrappy, clever discoveries in how to use more novel technologies (for Qwen and Deepseek, its mixture of expert models, that can do inference using a portion of the model at a time). The chinese also distill information from American models, so there’s that.

The American companies, from my impression don’t involve themselves with such lowly “hacks” because they have so much money to just push forward with doing everything on big heavy models that run on the most cutting edge nvidia chips that they can, the moment, kinda sorta get on demand (I say that in some degree of jest).

reply
The American companies would love to develop these 'hacks' because it would make them more money, something they are in existential need of right now.

They don't develop them because they don't collaborate publicly anymore.

Where would the whole industry be if Google never allowed publishing the transformers paper?

It's not a coincidence that the American AI industry grew fastest in capability when it was the most open.

reply
Just a crazy catch 22, it seems
reply
Why would they collaborate? Why not defect and just keep theirs private and implement the open ones?
reply
this is not an effective long term strategy in a collaborative environment that is advancing for the same reason that having a private secret fork of the linux kernel with a few proprietary improvements is not an effective strategy.

integrating your own work with the latest public advances takes resources. For one or two small changes this is manageable, but the further you diverge from the public, the cost of maintenance rises exponentially if you want to continue to integrate public advances. when you publish your meaningful advance, you offload the maintenance burden onto everyone else (and they only have to pay a linear cost rather than an exponential one) as it's integrated by default in new work.

In most cases, the (exponential) maintenance cost of integrating public advances with secret ones exceeds the value of the public advances, so most that undertake this strategy of advancing the open frontier in secret don't attempt to integrate continually, but instead try to make a breakaway sprint in isolation to grab a few sticky customers before the unstoppable wave of the public frontier catches up.

This is a pattern commonly seen in university research departments when researchers switch into product development mode, most of these projects are a sprint to advance away from the public frontier once a good idea is found and they do good work and find a few customers for a little while. But if you check back in a few years you won't find an advanced research department but a zombie IP company that brings in a steady income via IP enforcement and a small number of customers for whom switching is too expensive.

reply
I'm afraid I'm even balking at the word "pioneering" in context with US frontier labs. They are probably doing a few new things, right, but they are not blazing any trails for others to follow along, the Chinese are.
reply
> Publishing by necessity

It's more a cultural thing. Sharing progress is just in their blood.

reply
This is overly simplistic to the point of glazing. Plenty of Chinese companies maintain industrial secrets to gain an advantage.
reply
Chinese papers and techniques have been very influential and copied by US labs.

Multi-head Latent Attention (MLA), Multi-Token prediction, MoE architecture are some of the most famous examples.

reply
MoE is from Google (Noam Shazeer)

MTP is from Meta

Another DeepSeek advance that the west are copying is DeepSeek Sparse Attention (DSA)

reply
Mixture-of-Expert (MoE) was introduced in the 1990s [1, 2], see also [3, 4]. The idea was that MoE scales up model capacity and only introduces small computation overhead. MoEs did not become viable for high-performance applications until sparse routing was integrated with modern deep networks, made possible by large-scale distributed computation. The breakthrough came with the development of sparsely gated networks [5], which showed that it is possible to maintain model accuracy while activating only a small fraction of a large parameter network during both training and inference.

[1] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, G. E. Hinton, Adaptive mixtures of local experts. (1991)

[2] M. I. Jordan, R. A. Jacobs, Hierarchical mixtures of experts and the EM algorithm. (1993)

[3] L. Xu, M. Jordan, G. E. Hinton, An alternative model for mixtures of experts. (1994)

[4] S. Waterhouse, D. MacKay, A. Robinson, Bayesian methods for mixtures of experts. (1995)

[5] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, J. Dean, Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. (2017)

reply
Yes - I meant as applied to LLMs/Transformers.
reply