You might like this article [1], titled "FPGA-based CNN Acceleration using Pattern-Aware Pruning". More context and details can be found in the PhD thesis of Léo Pradels [2].
[1]: https://inria.hal.science/hal-04689673/document
[2]: https://theses.hal.science/tel-05021575v1/file/PRADELS_Leo.p...
If you can build chips that could run one specific LLM 100x faster than anything else, it would have a use case that nothing else could match.
And the high water mark of what can be solved by open models will keep going up.
That, if used in war, I would think, would need the ability to be updated frequently. For example, your enemy might find out (say by running tests on hardware they captured from you) that painting some red paint in a particular shape (a smiley might even work) on their hardware prevented your drones from attacking them because it confuses that pattern with the Red Cross logo.
[1] Yes, this is a massive simplification
In contrast, tomorrow’s models will typically run, although perhaps more slowly, on general-purpose inference hardware that was released today or even years ago.