They introduced a feature/optimization that triggered after an hour's idleness, so testing that the session continued properly afterwards seems kind of important. If nothing else, even the working-as-intended feature (context cleanup) could impact model skill in a current or future model version, so it would be well worth measuring any impact as part of the test suite.