So I could see something like this where the neural chipset has an LLM that cant be so easily updated baked into it, until you get a new device
I also think the dynamic would be really different if model inference can run at ridiculous speeds. You could make a genetic algorithm loop around it, so it can generate a population of proposals at each step, then have those tested and whittled down iteratively. If inference happens at thousands of tokens per second, then from user perspective it would still be really fast, and even a small model could solve complex problems.