If you can make the existing model faster, you can then save your inference budget to then make your model bigger, which then makes it smarter.
A lot of how smart the models can be comes down to budget. If you can make your existing thing cheaper, you can instead make it bigger for the same price.
There's diminishing returns and at some point making a model bigger makes it dumber.
(Not trying to flame bait or anything. I just wouldn’t call LLM as exhibiting intelligence. It is great at making connections based on probability but doesn’t have a semantic understanding of what it is doing)