Personally I consider < 60k to be the smart zone for opus. This is worse for opus 4.7 and 4.8 cause of the more granular tokenizer
60k isn't much bigger than the system prompt.
Plus I've found that the only time models go above 100k tokens anyway is when they've started looping at which point it's much better to go back anyway.
Anecdotally most models know their recall is terrible (or have been trained to act as such), that's why they constantly reread files before editing or while reasoning.
It seems that people have different workflows or repos, or memories or prompts or expectations.
I read it as a models performance being random and observed differences in the opinions are the results of the overinterpretation of the random outcomes.
I think however that some people seem to be always lucky which indicates that it is not random but rather some fixed differences between people and their environments.
I think that's issue, rather than 60K being small.
Most of the actual edits/changes I request to codex are solved within 100-150K tokens, beyond 200K I'd definitively try to restart the session as soon as I could as all models are horrible once you get across ~20% of the total context size. And this is while working on +million LOC codebases.
Problem I guess is that there is no solid and concrete evidence of this (to me [and others seemingly] obvious) degradation, but should be easy to prove, yet no one has time to sit down and show it :)
But the likelihood of a model getting minor details wrong once you're above some magical threshold between 15-20%, seems to skyrocket, and I hit that issue sufficient amount of times that now my workflow is trying to prevent that.
I routinely get claude to do things pretty decently and finish up easily in the 4-5 digit range of tokens. It seems to be doing the right kind of thing to not waste its time looking at 1000 files.
"YOU'RE HOLDING IT WRONG!"I usually see this when the context gets "tainted" as I call it. The model gets stuck on a bad path and there's no way to bring it back without clearing the context and starting again.
Frequently it'll be something as small as 1 sentence of a prompt many messages ago.
When cases like that happen, I reset the context and try to be explicit about assumptions and requirements to keep it off the "tainted" path. Other times it's actually useful and agents will do things they normally wouldn't do once the state is tainted. For instance, if you're testing a chat bot's ability to stay on topic, you can seed the context early with what you want it to do. It generally will refuse initially but later on in the conversation it will still silently take that seeded context into account almost "subconsciously" and become more likely to do the thing it originally refused.
Do you have any old documentation that it's picking up and referencing? If you set all claude settings back to default do you see the same issue?
Different models, and versions of models, use different types of attention, which affects their long-context performance, and no doubt also do different amounts/types of long context training.
Different agents build context differently and implement context compaction differently.
Unless someone else is using the same model as you, the same agent/harness as you, and doing very similar tasks, then there is no reason to suppose that their experience of model behavior relating to context size is going to be the same as yours.
Relax, I acknowledged this in my comment...
Which drugs?
opus 4.5 would start failing tool calls when approaching its 200k limit, opus 4.6 could get to ~300k before getting confused, opus 4.7 i could stretch to around 400k the dumb zone started, with opus 4.8 i've had sessions get over 500k comfortably.
admittedly we only had limited time with fable, but i had a couple sessions get into 800-900k just fine.
100k tokens "by lunch" is also not my finding, the newer models will hit that already right in the initial exploratory phase
Can you imagine even a junior making such a mistake?