Wayfinder Router: deterministic routing of queries between local and hosted LLM

upvote

Wayfinder Router: deterministic routing of queries between local and hosted LLM

(github.com)

103 points

by handfuloflight13 hours ago |

upvote

by josalhor10 hours ago|

[-]

We need LLM query routing at the OS level like Mobile data. I know it will sound crazy but hear me out. I think about this AI inference as infrastructure. I do not want to pay for it on every app I use it on. I do not think "I have to pay the mobile data of youtube, and the mobile data of whatsapp etc.". I pay Mobile data infrastructure and let my device route it appropiately. In fact, if we ever go the local llm route, you could have LLM capabilities without having access to the internet (or local LAN), and your OS/computer is the only one capable of doing that routing for you.

reply

upvote

by solenoid09379 hours ago|

[-]

It doesn't sound crazy at all, this seems almost obvious. The OS should provide a chat completions server and the user should be able to select the underlying LLM's server. This should be just like selecting a default search engine or browser.

Hopefully the EU forces US tech giants to do this. God knows Apple and Google won't do this on their own. They gotta get that sweet default provider revenue.

reply

upvote

by number653 minutes ago|

[-]

Apple told EU Citizens thats why they cant have Siri on their Phone as AI; they would have to provide an Interface where you can plug in your own LLM of your chosing.

reply

upvote

by KronisLV6 hours ago|

[-]

> It doesn't sound crazy at all, this seems almost obvious. The OS should provide a chat completions server and the user should be able to select the underlying LLM's server. This should be just like selecting a default search engine or browser.

I wonder why this hasn’t happened yet. If Microsoft wants to have a Copilot button and AI investments are all the rage now, surely anything to make integrating with them would be good for keeping the hype cycle alive for longer?

reply

upvote

by dewey4 hours ago|

[-]

Because it’s still pretty early in this journey and there’s some exclusivity deals to be made.

reply

upvote

by utopiah6 hours ago|

[-]

Honestly I don't get the point but if you want to explore that, both on desktop, mobile or headless server Linux allows you to try it.

You can run ollama with whatever you want on a Debian in literally minutes. You can even do that within a virtual machine using e.g. QEMU, so that you can do all the tests you need risk free.

Again I don't understand what that would enable that can't be done today but it's perfectly fine, you can try today anyway, no need to ask permission to anyone.

reply

upvote

by josalhor5 hours ago|

[-]

No, what I am saying simply does not exist yet.

I am saying I want my OS to expose APIs like it does for the disk or the network for AI. And I want my apps to be able to use those APIs.

I want my backend LLMs to be able to change on a whim. Imagine an Android app consuming from these LLMs. Maybe I am outside and it is making queries to Gemini. And maybe I get home and now it makes queries to my local llm, almost like connecting to local Wifi.

What I am saying does not exist on many levels:

- Agreed upon APIs for this don't think exist (in text maybe, but not in image/sound/video).

- OSs do not expose this (I am not talking manually configured user space stuff here).

- I see a world where your Network provider bundles "calls + data plan + AI tokens". But not only are the offerings for these not standardized, in order to even reach that point we would need to standardize the offerings. How do you compare intelligence among models? How do you compare cost?

- The apps need to start adopting this model

The tech is here, the ecosystem is not.

reply

upvote

by bglusman5 hours ago|

[-]

Well… it doesn’t exist FROM APPLE or MICROSOFT or GOOGLE at their shipped OS Level, but… fundamentally this isn’t a “true OS” level feature you’re asking for, it’s something you think the OS products should bake in, and you might be right! But I think the parents post is suggesting YOU CAN BUILD a prototype of what you want, how it should work, on Linux…

I have a project somewhat close to this I’ve put on pause the last month or so, partly because I’m not sure how useful it is or where to take next, but I may incorporate Wayfinder into it as a next step to improve its capabilities, as part of what it is a model gateway/router that this feels like could make more powerful/flexible in its decision making. I can’t decide if what I’m building is mostly a model recipe cookbook/platform, or a debugging tool, or both or something else at the moment, but, it can do most of that… maybe it’s part of what you want, if you figure that out better? feedback welcome! https://wardwright.dev/ https://github.com/bglusman/wardwright

reply

upvote

by josalhor3 hours ago|

[-]

> Well… it doesn’t exist FROM APPLE or MICROSOFT or GOOGLE at their shipped OS

What I am saying does not exist period. What I am saying is that there isn't a proper abstraction that helps the ecosystem build upon it.

> But I think the parents post is suggesting YOU CAN BUILD a prototype of what you want, how it should work, on Linux

I mean, yes. But me saying "this does not exist" and someone saying "but you can build it" does not take away from the fact that... Yeah, it doesn't exit :).

And also, no, I cannot build it, at least not alone. Because I want apps to eventually build upon my abstractions. This would require a good set of millions, of which the technical development would be a small part. The coordination, contracts, API definitions, even marketing, etc would be the majority.

I am saying something that Google, Telefonica, Microsoft etc could do.

reply

upvote

by try-working1 hours ago|

[-]

Exactly. That's why I built the role-model protocol, the pi-role-model extension so that Pi can tell the router where its requests should go, and the reference router implementation: https://news.ycombinator.com/item?id=48706181

reply

upvote

by idiotsecant4 hours ago|

[-]

Why do we need API endpoints? We have the best API there is - the CLI

reply

upvote

by josalhor3 hours ago|

[-]

Querying a CLI is also querying an API. I never said API endpoint. An API can be a Java Interface, a CLI, an endpoint etc.

reply

upvote

by throwaway8943454 hours ago|

[-]

I mean, the reason mobile data is part of the OS is because the antenna is hardware that must be shared across processes. Chat completions is just a network call like anything else—it’s already available to every app; they don’t need to pay separately (they can use the same account), they just pass their API key over the network to the completions server. What am I missing?

reply

upvote

by josalhor3 hours ago|

[-]

> Chat completions is just a network call like anything else

But what if Chat completion was resolved locally with hardware? Or what if I want my OS to coordinate Chat completions locally and, if my hardware is overwhelmed, send some to network?

You do have a valid point, yes, that what I am saying, without support for local hardware could be done with a sort of Open Router equivalent.

> they don’t need to pay separately (they can use the same account), they just pass their API key over the network to the completions server

That I would be conformable putting what I am saying on my parents phone. I do no trust my parents to manage API keys. What I am saying is an ecosystem thing, not only a low level thing

reply

upvote

by billiam50 minutes ago|

[-]

I sure hope the community is working on these APIs now for Linux, with pressure to come on M$ and Apple.

reply

upvote

by dd8601fn11 hours ago|

[-]

It's funny how much that first paragraph is Claude's voice. I don't know how it got trained so hard to use, "the shape of" for everything.

reply

upvote

by paradox46010 hours ago|

[-]

Loads of ed sheeran in the training data?

reply

upvote

by ninjalanternshk1 hours ago|

[-]

Thank you! I thought it was just me.

reply

upvote

by ifdefdebug7 hours ago|

[-]

Do you want the honest answer?

reply

upvote

by hmokiguess4 hours ago|

[-]

There's a hidden tax with routing this way, the original model loses context of what was done and either performs a regression or hallucinates.

I think this sort of behaviour started happening more frequently as agentic/ai programming became more often.

Back in the days (lol, reads like a long time ago but that's probably a few months?), you would not say "edit this typo", you would just open the file and not be lazy, and the harness would detect a user change and ground itself.

I feel like now, when I edit outside the AI flow, it goes and introduces a regression or gets lots thinking it didn't do that and something must have gone wrong.

reply

upvote

by kstenerud8 hours ago|

[-]

I'm not sure I understand what this is trying to solve?

If a prompt I give routes to one model, and then another prompt to another model, how does one tie the context together such that the next model knows what's going on?

Otherwise this would only be useful for one-off prompts as far as I can tell.

And if it did keep a context to be passed around, it would always land hot (not in the cache).

reply

upvote

by SwellJoe8 hours ago|

[-]

Every turn of a conversation with an LLM is getting the whole conversation. Caching complicates the picture, but not by a huge amount. That's why a short question at the end of a long conversation chews tokens faster than it would in a fresh session.

So, a conversation that's ongoing with one model then switching to another would presumably send the whole conversation and the new question. Which defeats the purpose of splitting traffic...so, you're not wrong to question how this actually improves things for anything other than short sessions, which you could choose your own model for if it's a small problem.

reply

upvote

by nok22kon6 hours ago|

[-]

sending the whole conversation to a cheap model could still be cheaper than sending just the latest message to the expensive one

you could even take this into account automatically to help decide

reply

upvote

by try-working7 hours ago|

[-]

Here's a use case: You want to extend the GPT 5.5 quota in you Codex subscription by routing some % of requests to DeepSeek V4 Pro. A router needs to figure out which requests to route where, for the appropriate difficulty level.

Another use case: You have two models on your local device. One is large and fairly powerful but low, the other is smaller, faster and good at tool calls and chat, but not great for writing and reviewing code. If you route between them per request, you can get a better developer experience with preserved performance.

The linked repo aims to help you achieve these things, as do I with the role-model router and protocol that I linked in another comment.

reply

upvote

by nok22kon6 hours ago|

[-]

to add to that, for example at the end of implementing a task, where the model runs the formatters, linters, tests, commits, pushes, this could be done by a very cheap model, and only switch to the main model again if something fails hard

there are some cache-busting considerations, but solvable

reply

upvote

by try-working6 hours ago|

[-]

indeed. i also wrote elsewhere that the current ideal number of models in a pool is probably 2, so if you route between two both will have warm cache, though not the full cache at all times, so you lose a little but not much.

reply

upvote

by 7 hours ago|

[-]

deleted

reply

upvote

by holoduke6 hours ago|

[-]

LLMs have no state. There is nothing remembered, nothing new learned. It's the same input , the same output always (unless seeding is randomized). So during a chat it won't matter if every chat turn a different provider is used.

reply

upvote

by spiderfarmer8 hours ago|

[-]

I'm not sure if output of easy commands like "summarize this" are added back to the context? I always assumed they are in a separate UI layer?

reply

upvote

by nok22kon6 hours ago|

[-]

we could use some composability.

today any kind of routing requires implementing an http proxy to put in the middle

ideally harnesses would support a routing plugin which receives the new whole context and returns just where to send it, and the harness does that. no http proxy. obviously some complications if you want to route from codex to anthropic or openrouter.

but we need to decouple the context building and routing decision from the actual http requests sending, we need to be able to insert "context/routing plugins" in the chain

reply

upvote

by dd8601fn3 hours ago|

[-]

Mine does this. Only I don’t use the whole context with the router because that’s wildly resource intensive and slow.

Then, a bit like open router, it does a classifier job with a fast model to choose which one should process the turn.

In my case I usually don’t do local vs remote… although it can. Now I use it for thinking vs no-think against my preferred local model, which is a huge time saver even with the added classification step.

reply

upvote

by cyanydeez6 hours ago|

[-]

I'm still waiting for an isolated protocol so we don't have to run the hanress directly on any of the code base's infrastructure. Something as simple as piping everything into and out of an ssh shell would be better than anything I've tested so far.

reply

upvote

by try-working4 hours ago|

[-]

i've created the protocol, role-model: https://github.com/try-works/role-model

reply

upvote

by niles4 hours ago|

[-]

Great name, but ironically hard to reason about from a role perspective, at least at the read me.

Does this interfere with cache hits? Could a single conversation or task span multiple roles?

Why are you building this? Does this maximize my toxen value by saving the hard tasks for the hard model? Does it maximize cache hits as part of its scoring? Does it help agents develop a specialist mindset? Are you anticipating users will have many local models hot, or is this also a model load/unload controller?

reply

upvote

by try-working3 hours ago|

[-]

I'm building this to achieve a state where I can, as a user and on my own device, decide that certain type of workloads should be handled by my Qwen model and keep the data on my device, while other workloads should be handled by more capable models.

For this we don't just need a router, because the information to make detailed and accurate routing decisions currently doesn't exist. And there are no standards but every lab and maybe even inference providers have their own way of implementing reasoning, chat templates, cache, tool use and so on. All issues that make models non-interoperable.

What we need is applications that clearly specify their requests so they can be accurately routed to a provider, whether local or remote. And for that they need to use a standard protocol for model requests and intent.

I wrote a longer piece here: https://news.ycombinator.com/item?id=48706181

reply

upvote

by darepublic5 hours ago|

[-]

Some kind of routing prompts to different models does make sense. But the usecase of saving money on simple prompt.. I think that has only a slight benefit. Fix my typo doesn't use many tokens anyhow.. also model switching still requires carrying over context so it does have some overhead right.

reply

upvote

by _pdp_9 hours ago|

[-]

There are so many proxies like this now but I can tell you from first hand experience this is not going to work. You cannot just route away from a situation at such a high level especially when we are talking about models that are quite different in behaviour, with different context windows and tuned to different tool uses. The harness is doing all kind of funky things to compensate for issues (like tool call truncation) that a proxy that routes dynamically like this will work against the very same strategies that make the harness work.

Interesting concept, work in theory, but I cannot see this being part of larger system.

reply

upvote

by Semaphor7 hours ago|

[-]

This is not choosing between different models, really. You should check the (interesting, yet sadly very slop-padded) readme. It’s about trying to make a binary decision: Is this a hard or easy question, and about making that decision extremely fast. They suggest putting another router that chooses the model behind it. I’m not sure how well it would work, but the idea is interesting and different than other routers.

reply

upvote

by try-working12 hours ago|

[-]

Love to see local/cloud routing explicitly supported.

I'm building another router for routing between local and remote models, ShowHN coming up later today. Here's a sneak preview of the github: https://github.com/try-works/role-model

reply

upvote

by try-working7 hours ago|

[-]

Posted my ShowHN: https://news.ycombinator.com/item?id=48706181

reply

upvote

by stanpinte8 hours ago|

[-]

We are developing many applications in my company, some of them safety critical. A natural routing way could happen for certain phases of development, and interfaces via git. One agent works on branch a and is responsible for brainstorm planning specs, and the other is responsible code and tests. The first agent creates tickets for the second one and the second one consumes these. This works with today’s standard harness.

reply

upvote

by JSR_FDED10 hours ago|

[-]

Slight tangent, but “Wayfinder sits behind whatever OpenAI-compatible client you already use” reminds me that descriptions of where proxies sit in the information flow always seem so arbitrary to me:

  - “after the client”
  - “reverse proxy” (in front  of servers)
  - “proxy” (in front of client)

I always have to look this up, surely there must be a standardized way to describe this?

reply

upvote

by parasti9 hours ago|

[-]

"after the client" and "in front of client" can mean the same thing depending on your viewpoint.

reply

upvote

by JSR_FDED9 hours ago|

[-]

Exactly, that’s my point

reply

upvote

by mrkn15 hours ago|

[-]

Has anyone tried the others listed? Any feedback?

reply

upvote

by throwawayk7h11 hours ago|

[-]

It'd be nice to just have a command prefix e.g.

/local fix my typo

reply

upvote

by girvo10 hours ago|

[-]

That’s what I did with Pi, super simple :)

reply

upvote

by ListeningPie10 hours ago|

[-]

can you send to multiple LLMs to compare responses? From that create a heuristic of which LLM gets what.

reply

upvote

by api7 hours ago|

[-]

I do this manually with a desktop app called BoltAI that lets you continue the whole conversation at your LLM of choice.

reply

upvote

by quijoteuniv11 hours ago|

[-]

This is the way!

reply

upvote

by tcballard11 hours ago|

[-]

I like to think so!

reply

upvote

by harvardhan15 hours ago|

[-]

axa

reply

upvote

by 5 hours ago|

[-]

deleted

reply

upvote

by harvardhan15 hours ago|

[-]

wfdwcZz

reply

upvote

by madikz4 hours ago|

[-]

[flagged]

reply

upvote

by pennylaw9461 hours ago|

[-]

[flagged]

reply

upvote

by systemalice3 hours ago|

[-]

[flagged]

reply

upvote

by kumiko_studio7 hours ago|

[-]

[flagged]

reply

upvote

by tcballard12 hours ago|

[-]

[dead]

reply

upvote

by SnipVote4 hours ago|

[-]

[dead]

reply

upvote

by niemandhier10 hours ago|

[-]

[dead]

reply

upvote

by tcballard12 hours ago|

[-]

[flagged]

reply

upvote

by terekhindc9 hours ago|

[-]

[dead]

reply

upvote

by kevinten1010 hours ago|

[-]

[dead]

reply