Playing with this some more and it's actively not good. Just basic mathematical errors riddling responses. Did some basic adversarial testing where its responses are analyzed by Gemini and Gemini is finding basic math errors across every relatively (relative to Opus, Gemini or GPT can handle) simple ask I make. Yikes.
I have the opposite experience: random HN/Reddit comments saying “this sucks” or “whoa this is a huge improvement” are the only benchmark that means anything. Standard benchmarks are all gamed and don’t capture the complexity of the real world.