upvote
Every turn of a conversation with an LLM is getting the whole conversation. Caching complicates the picture, but not by a huge amount. That's why a short question at the end of a long conversation chews tokens faster than it would in a fresh session.

So, a conversation that's ongoing with one model then switching to another would presumably send the whole conversation and the new question. Which defeats the purpose of splitting traffic...so, you're not wrong to question how this actually improves things for anything other than short sessions, which you could choose your own model for if it's a small problem.

reply
sending the whole conversation to a cheap model could still be cheaper than sending just the latest message to the expensive one

you could even take this into account automatically to help decide

reply
Here's a use case: You want to extend the GPT 5.5 quota in you Codex subscription by routing some % of requests to DeepSeek V4 Pro. A router needs to figure out which requests to route where, for the appropriate difficulty level.

Another use case: You have two models on your local device. One is large and fairly powerful but low, the other is smaller, faster and good at tool calls and chat, but not great for writing and reviewing code. If you route between them per request, you can get a better developer experience with preserved performance.

The linked repo aims to help you achieve these things, as do I with the role-model router and protocol that I linked in another comment.

reply
to add to that, for example at the end of implementing a task, where the model runs the formatters, linters, tests, commits, pushes, this could be done by a very cheap model, and only switch to the main model again if something fails hard

there are some cache-busting considerations, but solvable

reply
indeed. i also wrote elsewhere that the current ideal number of models in a pool is probably 2, so if you route between two both will have warm cache, though not the full cache at all times, so you lose a little but not much.
reply
deleted
reply
LLMs have no state. There is nothing remembered, nothing new learned. It's the same input , the same output always (unless seeding is randomized). So during a chat it won't matter if every chat turn a different provider is used.
reply
I'm not sure if output of easy commands like "summarize this" are added back to the context? I always assumed they are in a separate UI layer?
reply