The prototype is: silicon with a Llama 3.1 8B etched into it. Today's 4B models already outperform it.
Token rate in five digits is a major technical flex, but, does anyone really need to run a very dumb model at this speed?
The only things that come to mind that could reap a benefit are: asymmetric exotics like VLA action policies and voice stages for V2V models. Both of which are "small fast low latency model backed by a large smart model", and both depend on model to model comms, which this doesn't demonstrate.
In a way, it's an I/O accelerator rather than an inference engine. At best.
Which was always the killer assumption, and this changes little.
If you look at any development in computing, ASICs are the next step. It seems almost inevitable. Yes, it will always trail behind state of the art. But value will come quickly in a few generations.