Then, a bit like open router, it does a classifier job with a fast model to choose which one should process the turn.
In my case I usually don’t do local vs remote… although it can. Now I use it for thinking vs no-think against my preferred local model, which is a huge time saver even with the added classification step.
Does this interfere with cache hits? Could a single conversation or task span multiple roles?
Why are you building this? Does this maximize my toxen value by saving the hard tasks for the hard model? Does it maximize cache hits as part of its scoring? Does it help agents develop a specialist mindset? Are you anticipating users will have many local models hot, or is this also a model load/unload controller?
For this we don't just need a router, because the information to make detailed and accurate routing decisions currently doesn't exist. And there are no standards but every lab and maybe even inference providers have their own way of implementing reasoning, chat templates, cache, tool use and so on. All issues that make models non-interoperable.
What we need is applications that clearly specify their requests so they can be accurately routed to a provider, whether local or remote. And for that they need to use a standard protocol for model requests and intent.
I wrote a longer piece here: https://news.ycombinator.com/item?id=48706181