Because non-deterministic, because of constant updates and changes, and because the models are throttled according to number of users, releases, et al.
But I use Codex and Claude daily (work and hobby respectively). And there are days where one or the other just seems to have gotten up on the wrong side of the bed. Or is just being lazy. Or is suddenly super-powered do everything including what i asked it not to. (To be fair, the same thing happens with myself. :/)
I am convinced that if I was bench-marking, I would be convinced these are different models on different days.
[This conviction may say more about me then about the model.]