upvote
Actually yes. For example, I run local models for ingested documents, summaries, etc. The local models are fine, and there is no need for me to pay for tokens. Performance is adequate for that purpose as well. There are many other cases where I run at scale, time is flexible so things can move slower, and I rather keep it all in house. I'm not even getting into areas where data cannot leave the premises for legal reasons. Right now I'm limited with GPUs mostly. But if that world of local models on Apple silicon is so "good", there is room to expand it to other fruits...
reply
> These models are dumber and slower than API SoTA models and will always be.

Sure but you're paying per-token costs on the SoTA models that are roughly an order of magnitude higher than third-party inference on the locally available models. So when you account for per-token cost, the math skews the other way.

reply