upvote
This is literally what talaas has done with chatjimmy.ai.

Try it, it's llama 3.1 8B at 16000 tokens per second.

chatjimmy.ai https://taalas.com/the-path-to-ubiquitous-ai/

reply
Wow that incredibly fast. I like this outcome more than centralized datacenters.
reply
But it can only run that model, so it will be outdated in a few years at best.
reply
There’s lots of things you can do in hardware that could be done in software but cost. FPGA should have solved this long ago, but apparently the guys who own the IP want to make it as hard as possible to use it …
reply