Current open weight models < 20B are already capable of being useful. With even 1K tokens/second, they would change what it means to interact with them or for models to interact with the computer.
dwata: Entirely Local Financial Data Extraction from Emails Using Ministral 3 3B with Ollama: https://youtu.be/LVT-jYlvM18
Talas promises a 10x higher throughtput, being 10x cheaper and using 10x less electricity.
Looks like a good value proposition.
In full precision, yes. But this talaas chip uses a heavily quantized version (the article calls it "3/6 bit quant", probably similar to Q4_K_M). You dont even need a GPU to run that with reasonable performance, a CPU is fine.
Roof! Roof!
Exciting times.