The downside is of course that they consume many more tokens off your plan, and also that they are significantly slower. Kimi K2.7 takes about 7x longer to finish the same benchmark tasks as DeepSeek V4 Pro on my router benchmarks (https://role-model.dev/).
So for now I'm happy with just two models: GPT and DeepSeek.
1. DeepSeek V3.2, V4 Flash, V4 Pro, at high or max thinking, ... when recommending a model it should always be a precise model, not just an AI lab
2. DeepSeek V4 Flash at max thinking is the most verbose model (among top models) in the AA benchmarks. See the "Intelligence Index Token Use" chart: [1]
[1]: https://artificialanalysis.ai/models?models=gpt-5-5-high%2Cg...
I haven't tried deepseek yet, i should check this one out.
If it is needing to generate that many tokens to do the same tasks, then it probably has higher inference costs. So (for you) the model is bad, the plan is the same plan.
"Make a pac-man game in a single html page"
It went off and argued with itself for 20 minutes about how to lay out the map and then timed out.