I haven't read the article at the moment and I will try to read them hopefully but I wish to ask a question regarding, can this approach be done for say trillion or large parameter models as well or is there some wall which gets hit that makes it valuable for only smaller parameter model.
That being said, its still really incredible because in future, because these small models are really getting good for many use cases and speed becomes their bottleneck, with greater speeds at consumer hardware, I think its gonna be amazing work!
The last section of the article lays out the scaling laws that apply when porting this approach to another model. In a nutshell, DeepSeek V4 Pro with 49B active params is close to the upper bound.
Also worth noting that our results are currently for standard datacenter GPUs. On consumer hardware, though the same low-level optimization approach applies, the bandwidth limitations will cap the achievable speed.