Its only ~100k tokens. Anyone who routinely works with Codex (or any agentic harness really) can tell you how trivial it is to eat up 100k tokens doing complex work. I've personally had plenty of codex 5.5 xhigh sessions where just the pure chain of thought token count in a single turn exceeds 200k (and I assume doesn't go further only due to compaction meta-guidance; the harness will push the model to stay under 256k per turn/thinking block) .
I think the more interesting question is how many tokens were spent all told; the most interesting graph in the article imo is the success rate by log test-time compute: how many tokens are being spent on the right of the graph to hit a winning CoT/solution like this >50% of the time?