undefined

points

[-]

Every turn of a conversation with an LLM is getting the whole conversation. Caching complicates the picture, but not by a huge amount. That's why a short question at the end of a long conversation chews tokens faster than it would in a fresh session.

So, a conversation that's ongoing with one model then switching to another would presumably send the whole conversation and the new question. Which defeats the purpose of splitting traffic...so, you're not wrong to question how this actually improves things for anything other than short sessions, which you could choose your own model for if it's a small problem.

by nok22kon8 hours ago|

parent|

[-]

sending the whole conversation to a cheap model could still be cheaper than sending just the latest message to the expensive one

you could even take this into account automatically to help decide

by try-working9 hours ago|

prev|

[-]

Here's a use case: You want to extend the GPT 5.5 quota in you Codex subscription by routing some % of requests to DeepSeek V4 Pro. A router needs to figure out which requests to route where, for the appropriate difficulty level.

Another use case: You have two models on your local device. One is large and fairly powerful but low, the other is smaller, faster and good at tool calls and chat, but not great for writing and reviewing code. If you route between them per request, you can get a better developer experience with preserved performance.

The linked repo aims to help you achieve these things, as do I with the role-model router and protocol that I linked in another comment.

by nok22kon8 hours ago|

parent|

[-]

to add to that, for example at the end of implementing a task, where the model runs the formatters, linters, tests, commits, pushes, this could be done by a very cheap model, and only switch to the main model again if something fails hard

there are some cache-busting considerations, but solvable

by try-working8 hours ago|

parent|

[-]

indeed. i also wrote elsewhere that the current ideal number of models in a pool is probably 2, so if you route between two both will have warm cache, though not the full cache at all times, so you lose a little but not much.

by 8 hours ago|

prev|

[-]

deleted

by holoduke8 hours ago|

prev|

[-]

LLMs have no state. There is nothing remembered, nothing new learned. It's the same input , the same output always (unless seeding is randomized). So during a chat it won't matter if every chat turn a different provider is used.

by spiderfarmer10 hours ago|

prev|

[-]

I'm not sure if output of easy commands like "summarize this" are added back to the context? I always assumed they are in a separate UI layer?