upvote
Second here. From recent Alibaba Qwen conference: the all-in-one box (DC in a box - I think I was called Apsara, 0.6x0.6x1.5m) plug and play, 1.5TB GPU RAM, capability to run in a fully air gapped environment, any open models... All of that is roughly $300k one time. And this box can do non LLM tasks as well. Performance (throughput) around 20k t/s. Delivery time - around 2 months. For any medium sized company its perhaps cheaper to just buy it once than spending 1.5k for cloud per user
reply
Where can I find more information on this? A web search didn’t reveal much for me.
reply
Decent vs best-money-can-buy. Further, a self-hosted LLM will be much slower.
reply
I think we're all past the "bet-money-can-buy" stage. The most expensive models are an order of magnitude more expensive than the middle ground ones, so you need to be selective about what you run where.

And with a bit of careful routing - there isn't a lot stopping you sending the hard stuff to a cloud model and the average stuff to an on prem model.

reply
Only people who do pay-per-use optimize this. Most heavy users have their use covered by an employer.
reply
I have my use covered by my employer but we also have budgets and limits.
reply
I'd think for most companies the pace of change is too high at the moment. Give it a few years, a bit of a plateau in the improvements in frontier models and I can't see how many of these companies don't implode under the weight of competition on inference prices.
reply