undefined

[-]

100k tokens it's basically nothing these days. Claude Opus 4.6M with 1M context windows is just a different ball game

by plandis1 days ago|

[-]

Claude Opus can use a 1M context window but I’ve found it to degrade significantly past 250k in practice.

by marcus_holmes1 days ago|

[-]

Seconded. I'm getting used to the changes that happen in the conversation now, and can work out when it's time for my little coding buddy to have a nap.

And Opus is absolutely terrible at guessing how many tokens it's used. Having that as a number that the model can access itself would be a real boon.

by wild_egg1 days ago|

[-]

The Dumb Zone for Opus has always started at 80-100k tokens. The 1M token window just made the dumb zone bigger. Probably fine if the work isn't complicated but really I never want an Opus session to go much beyond 100k.

by braebo1 days ago|

[-]

The cost per message increases with context while quality decreases so it’s still generally good to practice strategic context engineering. Even with cross-repo changes on enterprise systems, it’s uncommon to need more than 100k (unless I’m using playwright mcp for testing).

by bredren1 days ago|

[-]

I had thought this, but my experience initially was that performance degradation began getting noticeable not long after crossing the old 250k barrier.

So, it has been convenient to not have hard stops / allow for extra but I still try to /clear at an actual 25% of the 1M anyhow.

This is in contrast to my use of the 1M opus model this past fall over the API, which seemed to perform more steadily.

by syntaxing1 days ago|

[-]

I’m genuinely surprised. I use copilot at work which is capped at 128K regardless of model and it’s a monorepo. Admittedly I know our code base really well so I can point towards different things quickly directly but I don’t think I ever needed compacting more than a handful in the past year. Let alone 1M tokens.

by arcanemachiner1 days ago|

[-]

Personal opinions follow:

Claude Opus at 150K context starts getting dumber and dumber.

Claude Opus at 200K+ is mentally retarded. Abandon hope and start wrapping up the session.

by operatingthetan1 days ago|

[-]

The context windows of these Chinese open-source subscriptions (GLM, Minimax, Kimi) is too small and I'm guessing it's because they are trying to keep them cheap to run. Fine for openclaw, not so much for coding.

by thawab1 days ago|

[-]

Don’t want to disappoint you, but above 200k opus memory is like a gold fish. You need to be below 150k to get good research and implementation.

by arcanemachiner1 days ago|

[-]

Oh nice, I just wrote pretty much the same comment above yours.

by epolanski1 days ago|

[-]

Quality degrades fast with context length for all models.

If you want quality you still have to compact or start new contextes often.

[-]

Is manual compation absolutely mandatory ?

by DeathArrow1 days ago|

[-]

When using GLM 5.1 in Open Code, compaction was done automatically.

by jauntywundrkind1 days ago|

[-]

I haven't screenshotted to alas, but it goes from being a perfectly reasonable chatty LLM, to suddenly spewing words and nonsense characters around this threshold, at least for me as a z.ai pro (mid tier) user.

For around a month the limit seemed to be a little over 60k! I was despondent!!

What's worse is that when it launched it was stable across the context window. My (wild) guess is that the model is stable but z.ai is doing something wonky with infrastructure, that they are trying to move from one context window to another or have some kv cache issues or some such, and it doesn't really work. If you fork or cancel in OpenCode there's a chance you see the issue much earlier, which feels like some other kind of hint about kv caching, maybe it not porting well between different shaped systems.

More maliciously minded, this artificial limit also gives them an artificial way to dial in system load. Just not delivering the context window the model has reduces the work of what they have to host?

But to the question: yes compaction is absolutely required. The ai can't even speak it's just a jumbled stream of words and punctuation once this hits. Is manual compaction required? One could find a way to build this into the harness, so no, it's a limitation of our tooling that our tooling doesn't work around the stated context window being (effectively) a lie.

I'd really like to see this improved! At least it's not 60-65k anymore; those were soul crushing weeks, where I felt like my treasured celebrated joyful z.ai plan was now near worthless.

There's a thread https://news.ycombinator.com/item?id=47678279 , and I have more extensive history / comments on what I've seen there.

The question is: will this reproduce on other hosts, now that glm-5.1 is released? I expect the issue is going to be z.ai specific, given what I've seen (200k works -> 60k -> 100k context windows working on glm-5.1).

by calgoo1 days ago|

[-]

I have gone back to having it create a todo.md file and break it into very small tasks. Then i just loop over each task with a clear context, and it works fine. a design.md or similar also helps, but most of the time i just have that all in a README.md file. I was also suspicious around the 100k almost to the token for it to start doing loops etc.

by disiplus1 days ago|

[-]

basically my expirience as well. Sometimes it can break past 100k and be ok, but mostly it breaks down.

[-]

I am on the mid tier Coding plan to trying it out for the sake of curiosity.

During off peak hour a simple 3 line CSS change took over 50 minutes and it routinely times out mid-tool and leaves dangling XML and tool uses everywhere, overwriting files badly or patching duplicate lines into files

by harias1 days ago|

[-]

Off peak for China or US

[-]

Off peak for China. Off peak times are only in one timezone

by InsideOutSanta1 days ago|

[-]

My impression is that different users get vastly different service, possibly based on location. I live in Western Europe, and it works perfectly for me. Never had a single timeout or noticeable quality degradation. My brother lives in East Asia, and it's unusable for him. Some days, it just literally does not work, no API calls are successful. Other days, it's slow or seems dumber than it should be.

[-]

It's now mid weekday in China timezone.

Starting an hour or two ago GLM's API endpoint is failing 7/8 times for me, my editor is retrying every request with backoff over a dozen times before it succeeds and wildly simple changes are taking over 30 minutes per step.

by csomar1 days ago|

[-]

Their distribution operation is very bad right now. The model is pretty decent when it works but they have lots of issues serving the people. That being said, I have had the same problems with Gemini (even worse in the last two weeks) and Claude. So it seems to be the norm in the industry.

by satvikpendem1 days ago|

[-]

Every model seems that way, going back to even GPT 3 and 4, the company comes out with a very impressive model that then regresses over a few months as the company tries to rein in inference costs through quantization and other methods.

by wolttam1 days ago|

[-]

This is surprising to me. Maybe because I'm on Pro, and not Lite. I signed up last week and managed to get a ton of good work done with 5.1. I think I did run into the odd quantization quirk, but overall: $30 well spent

by Mashimo1 days ago|

[-]

I'm also on the lite plan and have been using 5.1 for a few days now. It works fine for me.

But it's all casual side projects.

Edit: I often to /compact at around 100 000 token or switch to a new session. Maybe that is why.

by LaurensBER1 days ago|

[-]

I'm on their lite plan as well and I've been using it for my OpenClaw. It had some issues but it also one-shotted a very impressive dashboard for my Twitter bookmarks.

For the price this is a pretty damn impressive model.

by cmrdporcupine1 days ago|

[-]

Is there any advantage to their fixed payment plans at all vs just using this model via 3rd party providers via openrouter, given how relatively cheap they tend to be on a per-token basis?

Providers like DeepInfra are already giving access to 5.1 https://deepinfra.com/zai-org/GLM-5.1

$1.40 in $4.40 out $0.26 cached

/ 1M tokens

That's more expensive than other models, but not terrible, and will go down over time, and is far far cheaper than Opus or Sonnet or GPT.

I haven't had any bad luck with DeepInfra in particular with quantization or rate limiting. But I've only heard bad things about people who used z.ai directly.

by Lalabadie1 days ago|

[-]

I use GLM 5 Turbo sporadically for a client, and my Openrouter expense might climb over a dollar per day if I insist. At about 20 work days per month it's an easy choice.

by csomar1 days ago|

[-]

I have their most expensive plan and it's on-par and sometimes better than Claude although you have to keep context short. That being said, the quota is no longer generous. It's still priced below Claude but not by that much. (compared to a few months ago where your money gets you x10 in tokens)

by esafak1 days ago|

[-]

I'm on their Lite plan and I see some of this too. It is also slow. I use it as a backup.

by benterix1 days ago|

[-]

> Obvious quantization issues

Devil's advocate: why shouldn't they do it if OpenAI, Anthropic and Google get away with playing this game?

by cmrdporcupine1 days ago|

[-]

I think what Anthropic is doing is more subtle. It's less about quantizing and more about depth of thinking. They control it on their end and they're dynamically fiddling with those knobs.

by margorczynski1 days ago|