upvote
If you invest the minimum to run the model, obviously that's more expensive per-token than investing the optimum to get the best price/performance tradeoff (which for GLM 5.2 is at least five times that figure)

If you can bring the load to run the model on close to optimal hardware 24/7 with multiple concurrent requests, and have reasonably cheap power and AC, you would break even in a reasonable timespan. Which won't happen unless you are self-hosting for a medium-sized company. I guess you could sell your spare capacity to get better utilization ... and we've reinvented hosted inference

reply
I mean sure, I’d you’re attempting to run the biggest possible models, it’s going to require a stupid amount of compute? I thought we all knew this?

The appeal to me is that we can run that, but we can also run smaller models on your laptop _and it’s functional!_ I can run DeepSeek v4 flash and a qwen 3.6 on my laptop! Thats crazy good.

reply
.. conversely, all the cloud LLMs are being subsidized by their investors in addition to massive economies of scale.
reply
It is false to say that all cloud LLMs are subsidized. The open weights models are hosted through numerous third party providers on OpenRouter that are operating as hosting businesses. They aren’t spending investor money to provide tokens for you at below-cost rates. They’re operating as hosting businesses.
reply
economies of scale are enough to explain the entire price difference. Running 8 concurrent requests at 100 token/s on $100k hardware is a lot cheaper than running one concurrent request at 20 token/s on $20k hardware
reply