I also think the dynamic would be really different if model inference can run at ridiculous speeds. You could make a genetic algorithm loop around it, so it can generate a population of proposals at each step, then have those tested and whittled down iteratively. If inference happens at thousands of tokens per second, then from user perspective it would still be really fast, and even a small model could solve complex problems.