undefined

points

[-]

I see this said often and find it insane given how many times I find opus models making basic recall mistakes at <100k tokens.

Personally I consider < 60k to be the smart zone for opus. This is worse for opus 4.7 and 4.8 cause of the more granular tokenizer

by eterm11 hours ago|

parent|

[-]

60k is tiny, if it's making recall mistakes that early then you might have some false memories or incorrect instructions in your CLAUDE.md.

60k isn't much bigger than the system prompt.

by Bolwin4 hours ago|

parent|

[-]

I don't use Claude Code. I use my own handwritten agent (formerly using Pi) and know every token that goes into it. There are zero memories to confuse it. The system prompt is 200 tokens and completely self consistent.

Plus I've found that the only time models go above 100k tokens anyway is when they've started looping at which point it's much better to go back anyway.

Anecdotally most models know their recall is terrible (or have been trained to act as such), that's why they constantly reread files before editing or while reasoning.

by danielbln11 hours ago|

parent|

prev|

[-]

Yeah 60k is ludicrous, I've barely seeded the context at that point and I don't see context related degradation until well into the 600-700k.

by qsera10 hours ago|

parent|

[-]

In this thread: People tossing coins independently and fighting over the result they got.

by kuboble8 hours ago|

parent|

[-]

No it's not.

It seems that people have different workflows or repos, or memories or prompts or expectations.

by diab0lic7 hours ago|

parent|

[-]

For what it’s worth, as a third party I read your and qsera’s comments as saying the same thing.

by kuboble5 hours ago|

parent|

[-]

Maybe I misread the comment then.

I read it as a models performance being random and observed differences in the opinions are the results of the overinterpretation of the random outcomes.

I think however that some people seem to be always lucky which indicates that it is not random but rather some fixed differences between people and their environments.

by embedding-shape10 hours ago|

parent|

prev|

[-]

> I've barely seeded the context at that point

I think that's issue, rather than 60K being small.

Most of the actual edits/changes I request to codex are solved within 100-150K tokens, beyond 200K I'd definitively try to restart the session as soon as I could as all models are horrible once you get across ~20% of the total context size. And this is while working on +million LOC codebases.

Problem I guess is that there is no solid and concrete evidence of this (to me [and others seemingly] obvious) degradation, but should be easy to prove, yet no one has time to sit down and show it :)

But the likelihood of a model getting minor details wrong once you're above some magical threshold between 15-20%, seems to skyrocket, and I hit that issue sufficient amount of times that now my workflow is trying to prevent that.

by rtpg7 hours ago|

parent|

prev|

[-]

what are y'all doing to hit that? Do you just not give it any pointers and let it churn away? What kind of context are you handing off?

I routinely get claude to do things pretty decently and finish up easily in the 4-5 digit range of tokens. It seems to be doing the right kind of thing to not waste its time looking at 1000 files.

by da_grift_shift11 hours ago|

parent|

prev|

[-]

>you might have some false memories or incorrect instructions in your CLAUDE.md

    "YOU'RE HOLDING IT WRONG!"

by RugnirViking11 hours ago|

parent|

[-]

did you internalize what was wrong with that quote when it was said? does it apply here?

by perching_aix10 hours ago|

parent|

prev|

[-]

[dead]

by nijave7 hours ago|

parent|

prev|

[-]

>making basic recall mistakes at <100k tokens.

I usually see this when the context gets "tainted" as I call it. The model gets stuck on a bad path and there's no way to bring it back without clearing the context and starting again.

Frequently it'll be something as small as 1 sentence of a prompt many messages ago.

When cases like that happen, I reset the context and try to be explicit about assumptions and requirements to keep it off the "tainted" path. Other times it's actually useful and agents will do things they normally wouldn't do once the state is tainted. For instance, if you're testing a chat bot's ability to stay on topic, you can seed the context early with what you want it to do. It generally will refuse initially but later on in the conversation it will still silently take that seeded context into account almost "subconsciously" and become more likely to do the thing it originally refused.

by CjHuber11 hours ago|

parent|

prev|

[-]

I'm always a bit confused when people say things like this. 60k token is often more than the initial context I feed the model with. And I don't think I ever had a productive session that began under 150k tokens.

by embedding-shape10 hours ago|

parent|

[-]

Bit of what makes it so fun, our experiences seem to wildly differ! On one hand, you have experiences like yours, but then my own experience is that I never had a productive session when the scope grows beyond 150K tokens! If I needed 60K just as a starting context, I'd take that to mean the suggested change is way to large, and if the model cannot solve the entire thing within maybe 15-20% of the total context size, divide and conquer is needed otherwise there will be a lot of time wasted to patch things up when things are "completed".

by CjHuber9 hours ago|

parent|

[-]

Yeah indeed it's very interesting. And the 60k initial context don't even contain the suggested change yet. For me if I don't do this the current models tend to fixate and local patches instead of tracing symbols and making a holistic model of what a change interacts with in the codebase

by wg011 hours ago|

parent|

prev|

[-]

Not specific to Opus but yes it would make mistakes. I usually try to keep context window under 10%

by properbrew11 hours ago|

parent|

prev|

[-]

I hate to do the "you're holding it wrong" trope, but I think you might have something misconfigured somewhere unless you missed a 0, because just past 60k tokens is such a small context window to be seeing issue in.

Do you have any old documentation that it's picking up and referencing? If you set all claude settings back to default do you see the same issue?

by HarHarVeryFunny49 minutes ago|

prev|

[-]

Not everybody is using the same model and harness as you, nor using the model the same way as you.

Different models, and versions of models, use different types of attention, which affects their long-context performance, and no doubt also do different amounts/types of long context training.

Different agents build context differently and implement context compaction differently.

Unless someone else is using the same model as you, the same agent/harness as you, and doing very similar tasks, then there is no reason to suppose that their experience of model behavior relating to context size is going to be the same as yours.

by kelnos46 minutes ago|

parent|

[-]

> then there is no reason to suppose that their experience of model behavior relating to context size is going to be the same as yours.

Relax, I acknowledged this in my comment...

by arcanemachiner11 hours ago|

prev|

[-]

Opus 4.6 was on drugs past 200k, I skipped 4.7, 4.8 did good up to ~350k, and Fable did great beyond 400k, in my limited testing. The quality does appear to be trending upwards.

by throwaway3141559 hours ago|

parent|

[-]

> Opus 4.6 was on drugs past 200k

Which drugs?

by justinclift8 hours ago|

parent|

[-]

The way it hallucinates stuff, it'd probably be something in the LSD family. ;)

by aeonik7 hours ago|

parent|

[-]

Combine it with meth and sleep deprivation and that could explain it.

by nijave7 hours ago|

parent|

prev|

[-]

Shrooms, sometimes crack

by pdantix4 hours ago|

prev|

[-]

agreed. the claudes have been getting better and better with every release in this regard.

opus 4.5 would start failing tool calls when approaching its 200k limit, opus 4.6 could get to ~300k before getting confused, opus 4.7 i could stretch to around 400k the dumb zone started, with opus 4.8 i've had sessions get over 500k comfortably.

admittedly we only had limited time with fable, but i had a couple sessions get into 800-900k just fine.

by tyleo8 hours ago|

prev|

[-]

I often push past 300k or so and I’ve absolutely worked at 800k but it’s an observable problem. Large context windows can work depending on the problem but I do feel more effective biasing towards small ones <300k.

by fullstackchris11 hours ago|

prev|

[-]

Thats another problem of this post, the author mentions Claude but not explicitely what models...

100k tokens "by lunch" is also not my finding, the newer models will hit that already right in the initial exploratory phase

by arcanemachiner11 hours ago|

parent|

[-]

Really depends on the project.

by stavros11 hours ago|

parent|

prev|

[-]

I found "by lunch" odd too, but considering that Claude wrote the article, it's not going to know specifics.

by asd8812 hours ago|

prev|

[-]

I’ve had similar experiences with Fable. 70%+ context used out of 1M, still sharp and no memory issues.

by csomar11 hours ago|

prev|

[-]

I have a custom build command for a rust project (yarn build:lib) and my experience is 120k for GLM and roughly 200-300k for Opus. After that, they default to cargo build.

by trapexit11 hours ago|

parent|

[-]

My projects have specific build/verify steps as well, and after a certain point Claude forgets to run them. I’m going to try a “No brown M&Ms” hook to halt Claude if it tries to run the default command instead of the instructed commands from CLAUDE.md. Perhaps this will be a good signal that a compacted or fresh session is needed at that point to avoid mistakes.

by csomar6 hours ago|

parent|

[-]

I mean, that’s basically the magic of the harness. The whole thing that skyrocketted the intelligence is that the harness (cli tool) prevent the LLM from editing the file before reading it.

Can you imagine even a junior making such a mistake?

by cyanydeez11 hours ago|

prev|

[-]

As the gamblers say at the poker table: If you can't figure out who the mark is when you site down...