For JSON to text formatting it works well on a one-round basis. So I think you should realistically have an evaluation ready to go so you can use it on these models. I currently judge them myself but people often use a smart LLM as judge.
Today writing eval harness with Claude is 5 min job. Do it yourself so you can explore as quants on Gemma get better.
On top of that their $20 plan has much higher usage limits than Anthropic's $20 plan and they allow its use in e.g. opencode. So you can set up opencode to use both OpenAI's codex plan plus one of the more intelligent Chinese models so you can maximize your usage. Have it fully plan things out using GPT 5.4, write code using e.g. Qwen 3.6, then switch back to GPT 5.4 for review
That is I get more variance between opus 4.6 and itself than I do between the sota models.
I don’t have the budget for statistical relevance but I’m convinced people claiming broad differences are just vibing, or there are times when agent features make a big difference.