That's not what anyone means when they say frontier models, don't change the definition. It's almost as bad as open weight being subsumed by open source when it comes to local models.
I've tried both Opus and GPT 5.4, they also hallucinate just like the rest at a much higher cost.
The more you use a model overtime, the better you become with it. It's really hard to measure, my main metric lately has been tokens per second/time to complete task.
At this point I've the feeling frontier models are optimizing for benchmarks and one shot prompts.