It's pretty hard to measure because most context rot comes from
related context and the model has to be able to figure which parts are truly relevant, which ones are relevant but stale, which ones to ignore etc.
Each relevant thing is basically a rule. Trying to so something with 500 rules is what's hard.
If you take a standard benchmark and just prepend a random book to it, it will not capture that