> Importantly, we find that when used in an interactive, synchronous, “hands-on-keyboard” pattern, the benefits of the model were less clear. When used in this fashion, some users perceived Mythos Preview as too slow and did not realize as much value. Autonomous, long-running agent harnesses better elicited the model’s coding capabilities. (p201)
^^ From the surrounding context, this could just be because the model tends to do a lot of work in the background which naturally takes time.
> Terminal-Bench 2.0 timeouts get quite restrictive at times, especially with thinking models, which risks hiding real capabilities jumps behind seemingly uncorrelated confounders like sampling speed. Moreover, some Terminal-Bench 2.0 tasks have ambiguities and limited resource specs that don’t properly allow agents to explore the full solution space — both being currently addressed by the maintainers in the 2.1 update. To exclusively measure agentic coding capabilities net of the confounders, we also ran Terminal-Bench with the latest 2.1 fixes available on GitHub, while increasing the timeout limits to 4 hours (roughly four times the 2.0 baseline). This brought the mean reward to 92.1%. (p188)
> ...Mythos Preview represents only a modest accuracy improvement over our best Claude Opus 4.6 score (86.9% vs. 83.7%). However, the model achieves this score with a considerably smaller token footprint: the best Mythos Preview result uses 4.9× fewer tokens per task than Opus 4.6 (226k vs. 1.11M tokens per task). (p191)