Some of the work in that direction like Cerebras or Taalas have been doing is an interesting glimpse of what might be possible. In the meantime it's a fun thought experiment to wonder about what might be possible if even current state of the art models were available at like, a million tokens per second at a very low cost.
Modem vs Claude according to Claude:
300 @ 2368 characters - 1m 19s
1200 @ 2368 characters - 19.7s
2400 @ 2368 characters - 9.9s
14.4K @ 2368 characters - 1.6s
33.6K @ 2368 characters - 705 ms
56K @ 2368 characters - 447 ms
Claude @ 2368 characters - 7.9s
Whether it succeeds now depends a lot on the rate of improvement of model architecture. They're betting on model design and capability improvements slowing down - and then wiping the floor with everyone else with their inference economics.
Heck, I'm still a fan of Gemma 2 9B.
They seem to be doing well. I checked recently and their API is closed to signups due to overwhelming demand.
They built an entire wafer ASIC. The entire thing is one huge active ASIC. it takes a lot of cool engineering and cooling to make it work, and is very cool.