The other way this could go is that Z.ai could decide to release a smaller model targeted towards coding. They've done that before (GLM-4.7-Flash had 30B params). It would be great if they decided to release something in the 80B-100B param range. Something that size would easily run in a current Strix Halo system.
guess we'll be paying $200/month for a while
We are maybe 10 years off that.
RAM prices are going to continue to increase for the next 2 years at least.
Even putting that aside it's currently around 40-70,000 EUR to run this with a FP8 quantization (which you need to get close to maximum performance).
To actually get GPT 5.5-xhigh performance in the real world you need more headroom to support things like subagents (which will fill up your KV cache).
I like local models but realism is important. The sweet spot for the next 3 years will continue to be ~35B MoE models. They might match GPT 5.5-xhigh for chat-style problems but not for coding.
Also I m not sure where you are getting the under 2k value. I bought a Framework desktop 128GB last year and my setup was around 2.7k. The same setup now sells for around 4.7k.
Do not mix the benchmark results of GLM 5.2 FP16/FP8 with FP4 or FP2.
* FP4 will mean a accuracy loss of about 3%. Not noticeable but more chance for mistakes.
* FP2 ... what is what most people are able to run at home, for a "reasonable" price. Your looking at over 17% loss in accuracy.
At that point, your running at less then claude-sonnet-4.6, as the issues compound with accuracy losses. And reasonable priced is still in the ~ $5000 range (192GB + GPU 32GB active/kv cache system).
For that price your using a Codex / Claude Pro subscription for the next 4+ years with better models (by default), let alone with a FP2 GLM 5.2 version. And your looking at < 10 fps. A MacStudio with 512GB will net you 18 a 20fps+ with FP4, but ... i mean, those used to be $10.000.
Unfortunately the local hardware cost is a major issue for running large models like that.
Edit: Its funny whenever the issue of cost and what you need to give up vs the subscription services, there are always people who downvote in bad faith.
(it's not our first AI server, we already have experience deploying LLMs for our clients, so the numbers look solid)
And there is a lot of drama in those discussions. GLM 5.2 is a great model for corporations to run, but people only want to hear about running a 35B/27B or maybe a 120B model. And in that market, subscription services are simply way better value for money (take in account the privacy issues).
Everybody wants GPT 5.5/Opus 4.8 Max levels, on a budget that simply is not realistic. And GLM fit in that 4.8 medium/low level.
But then people do not want to be told that running a 750b model in Q2 or Q1 is just going to destroy the models accuracy. And that is still going to cost them 5k+ for that reduced model.
The whole local llm landscape from a consumer point of view, is just filled with odd people. lol.
Corporation really benefit from those models, because spending $90k on a server, is a deductible expense. And they are billed at token prices anyway from all the major providers. So its a even faster ROI on that hardware.
I am surprised that nobody figured out to make a business of selling leftover capacity from corporate llm installations, because there is easily 12h+ just wasted (unless its a large corp that has people in all timezones).
why is that? b/c the thing is waiting for the hoooman and idling? or some parallelizable interleaving steps?
I have no intuition yet how this works under the hood.
Waiting for the hooman (or tool calls) won't help either, of course.
In contrast, prompt prefill is more easily compute-bound, so there are interesting trade-offs for latency of decode vs prefill when the LLM utilization is high.
i think your answer was perfect not sure why you are being downvoted