I'm not being a hater, I love Opus for different reasons, but I can't rely on it for its technical ability.
Because GPT 5.5 just launched and those games take longer to accumulate data for, it just doesn't have enough samples yet. It will end up with a wider lead on Opus, I am sure. Coding evals always have large sample sizes on day 1. Good find, we should probably better adjust the weighting here for decision games with low match counts.