upvote
The reason you don't see more of this is because everyone does the math, realizes it's not a good deal, and then gives up on the idea.

There's a post at the top of /r/localllama about this exact math right now: https://www.reddit.com/r/LocalLLaMA/comments/1ubrcwj/tokenom...

TL;DR: Running GLM 5.2 is going to cost about $20K minimum, and that's going to be painfully slow compared to the cloud hosted versions. Even the estimates where the server is computing tokens 24/7 you can't break even for several years.

The only reason to run locally is if complete data privacy is your top concern. You pay a high premium for that.

reply
If you invest the minimum to run the model, obviously that's more expensive per-token than investing the optimum to get the best price/performance tradeoff (which for GLM 5.2 is at least five times that figure)

If you can bring the load to run the model on close to optimal hardware 24/7 with multiple concurrent requests, and have reasonably cheap power and AC, you would break even in a reasonable timespan. Which won't happen unless you are self-hosting for a medium-sized company. I guess you could sell your spare capacity to get better utilization ... and we've reinvented hosted inference

reply
I mean sure, I’d you’re attempting to run the biggest possible models, it’s going to require a stupid amount of compute? I thought we all knew this?

The appeal to me is that we can run that, but we can also run smaller models on your laptop _and it’s functional!_ I can run DeepSeek v4 flash and a qwen 3.6 on my laptop! Thats crazy good.

reply
.. conversely, all the cloud LLMs are being subsidized by their investors in addition to massive economies of scale.
reply
It is false to say that all cloud LLMs are subsidized. The open weights models are hosted through numerous third party providers on OpenRouter that are operating as hosting businesses. They aren’t spending investor money to provide tokens for you at below-cost rates. They’re operating as hosting businesses.
reply
economies of scale are enough to explain the entire price difference. Running 8 concurrent requests at 100 token/s on $100k hardware is a lot cheaper than running one concurrent request at 20 token/s on $20k hardware
reply
There are plenty of providers of open models that offer very affordable rates. Generally, I recommend looking at OpenRouter since they track various metrics for the various providers.
reply
Open models hosted in Cloud???
reply
AWS Bedrock hosts Gemma 4 31B and this is The Best Deal – hands down. Try it. Vertex also has Gemma 4 MoE version. Not "lobotomised" by quants. There are also GLM (latest) and Qwen / DS (but these two are not latest versions)
reply