And Opus is absolutely terrible at guessing how many tokens it's used. Having that as a number that the model can access itself would be a real boon.
So, it has been convenient to not have hard stops / allow for extra but I still try to /clear at an actual 25% of the 1M anyhow.
This is in contrast to my use of the 1M opus model this past fall over the API, which seemed to perform more steadily.
Claude Opus at 150K context starts getting dumber and dumber.
Claude Opus at 200K+ is mentally retarded. Abandon hope and start wrapping up the session.
If you want quality you still have to compact or start new contextes often.
For around a month the limit seemed to be a little over 60k! I was despondent!!
What's worse is that when it launched it was stable across the context window. My (wild) guess is that the model is stable but z.ai is doing something wonky with infrastructure, that they are trying to move from one context window to another or have some kv cache issues or some such, and it doesn't really work. If you fork or cancel in OpenCode there's a chance you see the issue much earlier, which feels like some other kind of hint about kv caching, maybe it not porting well between different shaped systems.
More maliciously minded, this artificial limit also gives them an artificial way to dial in system load. Just not delivering the context window the model has reduces the work of what they have to host?
But to the question: yes compaction is absolutely required. The ai can't even speak it's just a jumbled stream of words and punctuation once this hits. Is manual compaction required? One could find a way to build this into the harness, so no, it's a limitation of our tooling that our tooling doesn't work around the stated context window being (effectively) a lie.
I'd really like to see this improved! At least it's not 60-65k anymore; those were soul crushing weeks, where I felt like my treasured celebrated joyful z.ai plan was now near worthless.
There's a thread https://news.ycombinator.com/item?id=47678279 , and I have more extensive history / comments on what I've seen there.
The question is: will this reproduce on other hosts, now that glm-5.1 is released? I expect the issue is going to be z.ai specific, given what I've seen (200k works -> 60k -> 100k context windows working on glm-5.1).
During off peak hour a simple 3 line CSS change took over 50 minutes and it routinely times out mid-tool and leaves dangling XML and tool uses everywhere, overwriting files badly or patching duplicate lines into files
Starting an hour or two ago GLM's API endpoint is failing 7/8 times for me, my editor is retrying every request with backoff over a dozen times before it succeeds and wildly simple changes are taking over 30 minutes per step.
But it's all casual side projects.
Edit: I often to /compact at around 100 000 token or switch to a new session. Maybe that is why.
For the price this is a pretty damn impressive model.
Providers like DeepInfra are already giving access to 5.1 https://deepinfra.com/zai-org/GLM-5.1
$1.40 in $4.40 out $0.26 cached
/ 1M tokens
That's more expensive than other models, but not terrible, and will go down over time, and is far far cheaper than Opus or Sonnet or GPT.
I haven't had any bad luck with DeepInfra in particular with quantization or rate limiting. But I've only heard bad things about people who used z.ai directly.
Devil's advocate: why shouldn't they do it if OpenAI, Anthropic and Google get away with playing this game?