This seems extremely inefficient considering data transfer between model layers if the model is distributed. I found this project called Petals that claim up to 4 tok/s for a 180B model although its repository hasn't been updated in two years.
For prompt processing it would work though, and it could for diffusion LLMs as well.
Opensource/weight models will get better and better and eventually we will have mythos level running on smartphone/eyeglass hardware.
It is stupidly tedious currently to match supply with demand though because physical hardware like a 16gb ram MacBook doesn't mean there's truly 16gb available let alone matching models and all of their settings (kvcache, context limit, temperature, etc) to demand.
Would appreciate any help cus we need ai inference by the people for the people.
Speculating Experts Accelerates Inference for Mixture-of-Experts: https://arxiv.org/abs/2603.19289
There is a middle way; the policy space also includes government regulating both access and monopoly.
I’m opposed to monopolies of this tech, but I hope the risks of giving everyone jailbroken AGI/ASI are clear.
As a toy example you could imagine a Universal Basic AI where government subcontracts to (n_quorum) labs, everyone gets a token budget, but operating the APIs comes with the safety controls.
If everyone does get to run their own jailbroken AGI, then the only stable societal norm I see is A LOT of surveillance to make sure nobody is building CBRNE threats. This doesn’t seem like a clear win from a civil liberty perspective, though I could see the argument.
I think it’s a great project but the communication isn’t clear to me.
I'm not sure exactly why you would buy through them vs rolling your own if you could afford the equivalent hardware.
I'm a firm supporter of local inference though so good on them for doing something