upvote
I don't care about privacy and I didn't have much problems with reliability of AI companies. Spending ridiculous amount of money on hardware that's going to be obsolete in a few years and won't be utilized at 100% during that time is not something that many people would do, IMO. Privacy is good when it's given for free.

I would rather spend money on some pseudo-local inference (when cloud company manages everything for me and I just can specify some open source model and pay for GPU usage).

reply
on prem economics dont work because you can't batch requests. unless you are able to run 100 agents at the same time all the time
reply
> unless you are able to run 100 agents at the same time all the time

Except that newer "agent swarm" workflows do exactly that. Besides, batching requests generally comes with a sizeable increase in memory footprint, and memory is often the main bottleneck especially with the larger contexts that are typical of agent workflows. If you have plenty of agentic tasks that are not especially latency-critical and don't need the absolutely best model, it makes plenty of sense to schedule these for running locally.

reply