upvote
They postponed that change, here is the email they sent out:

> In May, we sent you an email announcing that starting today, the Claude Agent SDK, claude -p, and third-party apps built on the Agent SDK would stop drawing from subscription rate limits and move to a dedicated monthly credit. We're writing to let you know that we’re not making this change today. We’re working to update the plan to better support how users build with Claude subscriptions.

> What this means for you

> Nothing changes for now. Agent SDK, claude -p, and third-party app usage continues to work with your subscription exactly as it did before today, and there's no credit to claim. Your subscription limits are unchanged. When we have an update, we'll share it with advance notice before it takes effect

reply
Something I haven't been able to figure out.... How are you supposed to actually get an API key to use quota from your subscription? The terms of service still forbid using OAuth authentication and the API keys from the console indicate that you need to pre-load your account with funds when you try to use them.
reply
Z.ai does not lock you in to any harness.
reply
Is there a secure way to use GLM without spending $10K’s for local HW? I “only” have a 128GiB inference machine, and don’t really trust anthropic not to steal my IP over time.

I see no reason to trust Z.ai more than other vendors.

reply
Ollama Cloud has a $20 a month subscription. They say they retain 0 information. And rather than token based billing, it's GPU time billing.
reply
Kind of, you need at least 256 gb of vram and 24-40 gb of vram to run the 2bit quantization, because it's a moe you just need the expert to fit in vram to get significant improvement over a pure CPU setup. At 2bits though expect significant quality loss.
reply
2bits is a joke for serious work. You'd be better with Qwen3.6 under 30G probably.

But there are EU only providers for GLM5.2. For example tensorx. Depending on your definition of "secure" it may be acceptable.

reply
> 2bits is a joke for serious work.

I have not tried it but I will take your word on it. I don't think Qwen3.6 cuts it for large scale coding work. Reading issues, reading code sure, but biting into large issues no, it goes off the track consistently.

Depending on budget it may also be affordable to spin up servers to run it on demand.

reply
You'd need to multiply that $10k by 8 minimum.
reply
Why? 4 x DGX sparks should be enough. That's way less than $80k.
reply
From a quick google search a DGX spark seems to decode Llama 3.1 70B (FP8) at 2 tokens per second. I would expect the performance on a 768B parameter model spread across 4 to be significantly lower even though its a mixture of experts.

For real work anything below 60 tokens per second is essentially unusable. That's not taking into account the prompt filling, Llama 3.1. 70b on DGX spark runs at about 800 tps running at that speed prompt filling a 512k context takes like 11 minutes.

reply
deleted
reply
Neither does chatgpt. And is the harness lock-in such massive problem that you would pay 20x more?
reply
They reverted this decision, "claude -p [prompt]" works with your subscription ok.
reply
I'm using synthetic.new and Neuralwatt with pi and its good and also cheap
reply
I have had bad experience with neuralwatt GLM 5.2. Seems like they may be using quantized version of the model.
reply
Hi I'm the CTO of neuralwatt, would love to hear your feedback on what your experience was. Feel free to email me scott@neuralwatt.com. Also for GLM5.2 we run the FP8 quantization at 1M context which is a common deployment target.
reply
Hi Scott! Was just considering signing up, NW looks great (fp8 GLM 5.2 is good!) Standard cached token pricing for GLM 5.2 is pretty high, I'm wondering whether the KV cache for that model actually is that expensive to serve on average, or if Neuralwatt's energy pricing for long-running GLM 5.2 agents is especially competitive? The live energy stats don't break down by token type, would love to see that. And 2/3 of the examples given in docs/energy-methodology are models you don't even host anymore. Uncertainty and selective stats puts people off signing up, they tend to assume the worst. Oh, and MiMo or DS4 please :)
reply
Thanks for the feedback! Our primary focus is charging by energy, for token pricing we really just try to be close to the market. That being said I'll take a look at our token pricing to see if we need an update there https://portal.neuralwatt.com/energy-pricing Generally our users get much lower cost on energy than token pricing though on a typical request with a high prefix cache hit the input, cached costs is very small and the output energy cost is higher.

We definitely don't have any intention to obfuscate and in fact we actually try and provide more data than any other provider out there about both an individual request, as well as the fleet behavior. Since we tend to focus directly on our energy pricing and optimizing that the issue is likely where the ROI lies on energy optimization versus token optimization (totally correlated but we have other levers to reduce energy while keeping token counts the same).

reply
I had good experience with neuralwatt in my heavy testing on real project in last days. Price/performance for api pricing was great. When using with pi, I was a little confused on if/how it supports diff reasoning levels?
reply
deleted
reply
I think they rolled that back
reply
They canned the moved to make -p commands API billable.
reply