It looks like much more context is required to decide on the best model (e.g., summarizing logs might use a cheap model, whereas you likely want Opus/Mythos/GPT 5.6 to debug multithreading logic). In an agentic system, a decision about the model may be embedded in the decision to orchestrate the model.
But intuitively I think it makes sense that a model can learn what model to route things to if it has all the relevant info, and experimentally it works pretty well in our experience
Hard to quantify this ofc but that's what I've felt vibes wise from using this for the last month.
How does this router translate to $$$ when developing?
This is the key thing that other routers we've seen miss: they're stateless so for a coding agent use case you end up spending more money due to all the cache misses.
In practice you just pick one and stick with it until the API stops or you hit performance issues.
I'm just trying to figure out why on the fly routing would beat testing and tuning and locking models and versions for each class of call, with evals and auto tunes running to explore more possible models for commonly run classes of prompt over time . . .
As prices increase we will see more of these tools to optimise and make the best use of token budget
Happy to talk about this in some more depth if there's anything specific you're curious about!
Will this use my Claude Pro/Max subscription? Or will it always use the API billing "pay as you go"?
We haven't yet set up local model routing though, that's really interesting - have you had any success using local models for coding tasks? Tbh I haven't heard many success stories from using local models yet
Also, small LLMs are prone to stop before completion, throw errors and produce loops. Is this factored in the design of the tool? I am not sure.
edit: spellcheck
Totally right about small LLMs btw, that's why we trained this on real agent sessions where we forced it to use different models. If the routing model sees small models can't handle a certain type of task then they won't be assigned. (Also as a fallback we have some guardrails that will have a bigger model come in to "rescue" a smaller model if it gets stuck)
1. https://github.com/instavm/murmur - Murmur
Also the throughput kind of increases since providers are different.
In practice, lots of ppl are using this to make their Claude sub limits go further!
This is probably not a very effective way of marketing imo. At least, it turns me completely off.
So our routing is cache-aware. It will have a much higher threshold to switch from one model to another if there's already some cache for the first model. Experimentally this solves the problem (like I said we've saved 40% ourselves vs. what we would have otherwise paid).
Do people voluntarily use these proxies/routers, knowing their prompts, outputs and code will be seen by other people ?
I get it might be ok for personal projects, but for anything that makes money and is a part of business... this must be big no-no ?
But of course since the source is available you can also run it locally or self host