undefined

points

[-]

You can batch requests when running locally too, if you're using a model with low-enough requirements for KV-cache; essentially targeting the same resource efficiencies that the big providers rely on. This is useful since it gives you more compute throughput "for free" during decode, even when running on very limited hardware.

by duskdozer3 hours ago|

prev|

[-]

Maybe people would target their use more appropriately, then.

by amelius3 hours ago|

prev|

[-]

It's even more awful if the compute capital is owned by only a handful of players.