upvote
Is there a secure way to use GLM without spending $10K’s for local HW? I “only” have a 128GiB inference machine, and don’t really trust anthropic not to steal my IP over time.

I see no reason to trust Z.ai more than other vendors.

reply
Ollama Cloud has a $20 a month subscription. They say they retain 0 information. And rather than token based billing, it's GPU time billing.
reply
Kind of, you need at least 256 gb of vram and 24-40 gb of vram to run the 2bit quantization, because it's a moe you just need the expert to fit in vram to get significant improvement over a pure CPU setup. At 2bits though expect significant quality loss.
reply
2bits is a joke for serious work. You'd be better with Qwen3.6 under 30G probably.

But there are EU only providers for GLM5.2. For example tensorx. Depending on your definition of "secure" it may be acceptable.

reply
> 2bits is a joke for serious work.

I have not tried it but I will take your word on it. I don't think Qwen3.6 cuts it for large scale coding work. Reading issues, reading code sure, but biting into large issues no, it goes off the track consistently.

Depending on budget it may also be affordable to spin up servers to run it on demand.

reply
You'd need to multiply that $10k by 8 minimum.
reply
Why? 4 x DGX sparks should be enough. That's way less than $80k.
reply
From a quick google search a DGX spark seems to decode Llama 3.1 70B (FP8) at 2 tokens per second. I would expect the performance on a 768B parameter model spread across 4 to be significantly lower even though its a mixture of experts.

For real work anything below 60 tokens per second is essentially unusable. That's not taking into account the prompt filling, Llama 3.1. 70b on DGX spark runs at about 800 tps running at that speed prompt filling a 512k context takes like 11 minutes.

reply
deleted
reply
Neither does chatgpt. And is the harness lock-in such massive problem that you would pay 20x more?
reply