upvote
the hardwired model is Llama 3.1 8B, which is a lightweight model from two years ago. Unlike other models, it doesn't use "reasoning:" the time between question and answer is spent predicting the next tokens. It doesn't run faster because it uses less time to "think," It runs faster because its weights are hardwired into the chip rather than loaded from memory. A larger model running on a larger hardwired chip would run about as fast and get far more accurate results. That's what this proof of concept shows
reply
I see, that's very cool, that's the context I was missing, thanks a lot for explaining.
reply
If it's incredibly fast at a 2022 state of the art level of accuracy, then surely it's only a matter of time until it's incredibly fast at a 2026 level of accuracy.
reply
yeah this is mindblowing speed. imagine this with opus 4.6 or gpt 5.2. probably coming soon
reply
I'd be happy if they can run GLM 5 like that. It's amazing at coding.
reply
Why do you assume this?

I can produce total jibberish even faster, doesn’t mean I produce Einstein level thought if I slow down

reply
Better models already exist, this is just proving you can dramatically increase inference speeds / reduce inference costs.

It isn't about model capability - it's about inference hardware. Same smarts, faster.

reply
Not what he said.
reply
I think it might be pretty good for translation. Especially when fed with small chunks of the content at a time so it doesn't lose track on longer texts.
reply