undefined

upvote

points

by horsawlarway19 hours ago |

upvote

by cortesoft19 hours ago|

[-]

They postponed that change, here is the email they sent out:

> In May, we sent you an email announcing that starting today, the Claude Agent SDK, claude -p, and third-party apps built on the Agent SDK would stop drawing from subscription rate limits and move to a dedicated monthly credit. We're writing to let you know that we’re not making this change today. We’re working to update the plan to better support how users build with Claude subscriptions.

> What this means for you

> Nothing changes for now. Agent SDK, claude -p, and third-party app usage continues to work with your subscription exactly as it did before today, and there's no credit to claim. Your subscription limits are unchanged. When we have an update, we'll share it with advance notice before it takes effect

reply

upvote

by clhodapp13 hours ago|

[-]

Something I haven't been able to figure out.... How are you supposed to actually get an API key to use quota from your subscription? The terms of service still forbid using OAuth authentication and the API keys from the console indicate that you need to pre-load your account with funds when you try to use them.

reply

upvote

by throwawayffffas19 hours ago|

[-]

Z.ai does not lock you in to any harness.

reply

upvote

by hedora15 hours ago|

[-]

Is there a secure way to use GLM without spending $10K’s for local HW? I “only” have a 128GiB inference machine, and don’t really trust anthropic not to steal my IP over time.

I see no reason to trust Z.ai more than other vendors.

reply

upvote

by dalenw2 hours ago|

[-]

Ollama Cloud has a $20 a month subscription. They say they retain 0 information. And rather than token based billing, it's GPU time billing.

reply

upvote

by throwawayffffas11 hours ago|

[-]

Kind of, you need at least 256 gb of vram and 24-40 gb of vram to run the 2bit quantization, because it's a moe you just need the expert to fit in vram to get significant improvement over a pure CPU setup. At 2bits though expect significant quality loss.

reply

upvote

by Roark666 hours ago|

[-]

2bits is a joke for serious work. You'd be better with Qwen3.6 under 30G probably.

But there are EU only providers for GLM5.2. For example tensorx. Depending on your definition of "secure" it may be acceptable.

reply

upvote

by throwawayffffas5 hours ago|

[-]

> 2bits is a joke for serious work.

I have not tried it but I will take your word on it. I don't think Qwen3.6 cuts it for large scale coding work. Reading issues, reading code sure, but biting into large issues no, it goes off the track consistently.

Depending on budget it may also be affordable to spin up servers to run it on demand.

reply

upvote

by villish12 hours ago|

[-]

You'd need to multiply that $10k by 8 minimum.

reply

upvote

by naasking4 hours ago|

[-]

Why? 4 x DGX sparks should be enough. That's way less than $80k.

reply

upvote

by throwawayffffas3 hours ago|

[-]

From a quick google search a DGX spark seems to decode Llama 3.1 70B (FP8) at 2 tokens per second. I would expect the performance on a 768B parameter model spread across 4 to be significantly lower even though its a mixture of experts.

For real work anything below 60 tokens per second is essentially unusable. That's not taking into account the prompt filling, Llama 3.1. 70b on DGX spark runs at about 800 tps running at that speed prompt filling a 512k context takes like 11 minutes.

reply

upvote

by 4 hours ago|

[-]

deleted

reply

upvote

by orangeisthe5 hours ago|

[-]

Neither does chatgpt. And is the harness lock-in such massive problem that you would pay 20x more?

reply

upvote

by huksley11 hours ago|

[-]

They reverted this decision, "claude -p [prompt]" works with your subscription ok.

reply

upvote

by sroerick19 hours ago|

[-]

I'm using synthetic.new and Neuralwatt with pi and its good and also cheap

reply

upvote

by computerex19 hours ago|

[-]

I have had bad experience with neuralwatt GLM 5.2. Seems like they may be using quantized version of the model.

reply

upvote

by scottcha17 hours ago|

[-]

Hi I'm the CTO of neuralwatt, would love to hear your feedback on what your experience was. Feel free to email me scott@neuralwatt.com. Also for GLM5.2 we run the FP8 quantization at 1M context which is a common deployment target.

reply

upvote

by versteegen14 hours ago|

[-]

Hi Scott! Was just considering signing up, NW looks great (fp8 GLM 5.2 is good!) Standard cached token pricing for GLM 5.2 is pretty high, I'm wondering whether the KV cache for that model actually is that expensive to serve on average, or if Neuralwatt's energy pricing for long-running GLM 5.2 agents is especially competitive? The live energy stats don't break down by token type, would love to see that. And 2/3 of the examples given in docs/energy-methodology are models you don't even host anymore. Uncertainty and selective stats puts people off signing up, they tend to assume the worst. Oh, and MiMo or DS4 please :)

reply

upvote

by scottcha3 hours ago|

[-]

Thanks for the feedback! Our primary focus is charging by energy, for token pricing we really just try to be close to the market. That being said I'll take a look at our token pricing to see if we need an update there https://portal.neuralwatt.com/energy-pricing Generally our users get much lower cost on energy than token pricing though on a typical request with a high prefix cache hit the input, cached costs is very small and the output energy cost is higher.

We definitely don't have any intention to obfuscate and in fact we actually try and provide more data than any other provider out there about both an individual request, as well as the fleet behavior. Since we tend to focus directly on our energy pricing and optimizing that the issue is likely where the ROI lies on energy optimization versus token optimization (totally correlated but we have other levers to reduce energy while keeping token counts the same).

reply

upvote

by johne2014 hours ago|

[-]

I had good experience with neuralwatt in my heavy testing on real project in last days. Price/performance for api pricing was great. When using with pi, I was a little confused on if/how it supports diff reasoning levels?

reply

upvote

by 11 hours ago|

[-]

deleted

reply

upvote

by weird-eye-issue19 hours ago|

[-]

I think they rolled that back

reply

upvote

by smcleod19 hours ago|

[-]

They canned the moved to make -p commands API billable.

reply