It seems that people have different workflows or repos, or memories or prompts or expectations.
I read it as a models performance being random and observed differences in the opinions are the results of the overinterpretation of the random outcomes.
I think however that some people seem to be always lucky which indicates that it is not random but rather some fixed differences between people and their environments.
I think that's issue, rather than 60K being small.
Most of the actual edits/changes I request to codex are solved within 100-150K tokens, beyond 200K I'd definitively try to restart the session as soon as I could as all models are horrible once you get across ~20% of the total context size. And this is while working on +million LOC codebases.
Problem I guess is that there is no solid and concrete evidence of this (to me [and others seemingly] obvious) degradation, but should be easy to prove, yet no one has time to sit down and show it :)
But the likelihood of a model getting minor details wrong once you're above some magical threshold between 15-20%, seems to skyrocket, and I hit that issue sufficient amount of times that now my workflow is trying to prevent that.
I routinely get claude to do things pretty decently and finish up easily in the 4-5 digit range of tokens. It seems to be doing the right kind of thing to not waste its time looking at 1000 files.