Generally, you use an ASIC to perform a specific task. In this case, I think the takeaway is the LLM functionality here is performance-sensitive, and has enough utility as-is to choose ASIC.
AI being static weights is already challenged with the frequent model updates we already see - but may even be a relic once we find a new architecture.
And then it'll increasingly make sense to build such a chip into laptops, smartphones, wearables. Not for high-end tasks, but to drive the everyday bread-and-butter tasks.
FPGAs don’t scale if they did all GPUs would’ve been replaced by FPGAs for graphics a long time ago.
You use an FPGA when spinning a custom ASIC doesn’t makes financial sense and generic processor such as a CPU or GPU is overkill.
Arguably the middle ground here are TPUs, just taking the most efficient parts of a “GPU” when it comes to these workloads but still relying on memory access in every step of the computation.
I think burning the weights into the gates is kinda new.
("Weights to gates." "Weighted gates"? "Gated weights"?)
It’s also not that different than how TPUs work where they have special registers in their PEs for weights.
Also, offline access is still a necessity for many usecases. If you have something like an autocomplete feature that stops working when you're on the subway, the change in UX between offline and online makes the feature more disruptive than helpful.
To your point, its neat tech, but the limitations are obvious since 'printing' only one LLM ensures further concentration of power. In other words, history repeats itself.
I don't expect it's like super commercially viable today, but for sure things need to trend to radically more efficient AI solutions.
If you baked one of these into a smart speaker that could call tools to control lights and play music, it will still be able to do that when Llama 4 or 5 or 6 comes out.
[1] although security might be a big enough reason for upgrades to still be required
In the real world, theres talking refrigerators who dont need to know how to recite shakespeare.
I think the interesting point is the transition time. When is it ROI-positive to tape out a chip for your new model? There’s a bunch of fun infra to build to make this process cheaper/faster and I imagine MoE will bring some challenges.
We transitioned from software on CPUs to fixed GPU hardware... But then we transitioned back to software running on GPUs! So there's no way you can say "of course this is the future".