Here is the question for which I cannot find an answer, and cannot yet afford to answer myself:
In Claude Code, I use Opus 4.6 1M, but stay under 250k via careful session management to avoid known NoLiMa [0] / context rot [1] crap. The question I keep wanting answered though: at ~165k tokens used, does Opus 1M actually deliver higher quality than Opus 200k?
NoLiMa would indicate that with a ~165k request, Opus 200k would suck, and Opus 1M would be better (as a lower percentage of the context window was used)... but they are the same model. However, there are practical inference deployment differences that could change the whole paradigm, right? I am so confused.
Anthropic says it's the same model [2]. But, Claude Code's own source treats them as distinct variants with separate routing [3]. Closest test I found [4] asserts they're identical below 200K but it never actually A/B tests, correct?
Inside Claude Code it's probably not testable, right? According to this issue [5], the CLI is non-deterministic for identical inputs, and agent sessions branch on tool-use. Would need a clean API-level test.
The API level test is what I really want to know for the Claude based features in my own apps. Is there a real benchmark for this?
I have reached the limits of my understanding on this problem. If what I am trying to say makes any sense, any help would be greatly appreciated.
If anyone could help me ask the question better, that would also be appreciated.
[0] https://arxiv.org/abs/2502.05167
[1] https://research.trychroma.com/context-rot
[2] https://claude.com/blog/1m-context-ga
[3] https://github.com/anthropics/claude-code/issues/35545
[4] https://www.claudecodecamp.com/p/claude-code-1m-context-wind...