upvote
It's pretty hard to measure because most context rot comes from related context and the model has to be able to figure which parts are truly relevant, which ones are relevant but stale, which ones to ignore etc.

Each relevant thing is basically a rule. Trying to so something with 500 rules is what's hard.

If you take a standard benchmark and just prepend a random book to it, it will not capture that

reply
Would be still interesting whether it degraded the performance in that case. Further, many non-agentic benchmarks consist of many short tasks, so one could fill the context with task/response pairs from other tasks (like in a standard chat environment) and then ask the current task at the end. Given that the tasks are probably somewhat similar, context rot should occur.
reply