And even that would be rich as a accusation from SOTAs that depend on explicitly disregarding millions of training data intellectual property..
LLMs are themselves copy cats.
I say thanks for open sourcing and thereby promoting affordable innovation, instead of "nefarious". :)
On the architectures side, I'm a lot more interesting in attention residuals than anything else, one of those things that seems obvious in hindsight and Kimi have proven it at scale.
Yes, variants typically 2-3x less good...
Same with speculative decoding... They all do something, but there are known techniques that are substantially better - that just were't known when they started development of the previous models.
MTP will still be highly valuable for interactive use of course.