(github.com)
Hopefully the EU forces US tech giants to do this. God knows Apple and Google won't do this on their own. They gotta get that sweet default provider revenue.
I wonder why this hasn’t happened yet. If Microsoft wants to have a Copilot button and AI investments are all the rage now, surely anything to make integrating with them would be good for keeping the hype cycle alive for longer?
You can run ollama with whatever you want on a Debian in literally minutes. You can even do that within a virtual machine using e.g. QEMU, so that you can do all the tests you need risk free.
Again I don't understand what that would enable that can't be done today but it's perfectly fine, you can try today anyway, no need to ask permission to anyone.
I am saying I want my OS to expose APIs like it does for the disk or the network for AI. And I want my apps to be able to use those APIs.
I want my backend LLMs to be able to change on a whim. Imagine an Android app consuming from these LLMs. Maybe I am outside and it is making queries to Gemini. And maybe I get home and now it makes queries to my local llm, almost like connecting to local Wifi.
What I am saying does not exist on many levels:
- Agreed upon APIs for this don't think exist (in text maybe, but not in image/sound/video).
- OSs do not expose this (I am not talking manually configured user space stuff here).
- I see a world where your Network provider bundles "calls + data plan + AI tokens". But not only are the offerings for these not standardized, in order to even reach that point we would need to standardize the offerings. How do you compare intelligence among models? How do you compare cost?
- The apps need to start adopting this model
The tech is here, the ecosystem is not.
I have a project somewhat close to this I’ve put on pause the last month or so, partly because I’m not sure how useful it is or where to take next, but I may incorporate Wayfinder into it as a next step to improve its capabilities, as part of what it is a model gateway/router that this feels like could make more powerful/flexible in its decision making. I can’t decide if what I’m building is mostly a model recipe cookbook/platform, or a debugging tool, or both or something else at the moment, but, it can do most of that… maybe it’s part of what you want, if you figure that out better? feedback welcome! https://wardwright.dev/ https://github.com/bglusman/wardwright
What I am saying does not exist period. What I am saying is that there isn't a proper abstraction that helps the ecosystem build upon it.
> But I think the parents post is suggesting YOU CAN BUILD a prototype of what you want, how it should work, on Linux
I mean, yes. But me saying "this does not exist" and someone saying "but you can build it" does not take away from the fact that... Yeah, it doesn't exit :).
And also, no, I cannot build it, at least not alone. Because I want apps to eventually build upon my abstractions. This would require a good set of millions, of which the technical development would be a small part. The coordination, contracts, API definitions, even marketing, etc would be the majority.
I am saying something that Google, Telefonica, Microsoft etc could do.
But what if Chat completion was resolved locally with hardware? Or what if I want my OS to coordinate Chat completions locally and, if my hardware is overwhelmed, send some to network?
You do have a valid point, yes, that what I am saying, without support for local hardware could be done with a sort of Open Router equivalent.
> they don’t need to pay separately (they can use the same account), they just pass their API key over the network to the completions server
That I would be conformable putting what I am saying on my parents phone. I do no trust my parents to manage API keys. What I am saying is an ecosystem thing, not only a low level thing
I think this sort of behaviour started happening more frequently as agentic/ai programming became more often.
Back in the days (lol, reads like a long time ago but that's probably a few months?), you would not say "edit this typo", you would just open the file and not be lazy, and the harness would detect a user change and ground itself.
I feel like now, when I edit outside the AI flow, it goes and introduces a regression or gets lots thinking it didn't do that and something must have gone wrong.
If a prompt I give routes to one model, and then another prompt to another model, how does one tie the context together such that the next model knows what's going on?
Otherwise this would only be useful for one-off prompts as far as I can tell.
And if it did keep a context to be passed around, it would always land hot (not in the cache).
So, a conversation that's ongoing with one model then switching to another would presumably send the whole conversation and the new question. Which defeats the purpose of splitting traffic...so, you're not wrong to question how this actually improves things for anything other than short sessions, which you could choose your own model for if it's a small problem.
you could even take this into account automatically to help decide
Another use case: You have two models on your local device. One is large and fairly powerful but low, the other is smaller, faster and good at tool calls and chat, but not great for writing and reviewing code. If you route between them per request, you can get a better developer experience with preserved performance.
The linked repo aims to help you achieve these things, as do I with the role-model router and protocol that I linked in another comment.
there are some cache-busting considerations, but solvable
today any kind of routing requires implementing an http proxy to put in the middle
ideally harnesses would support a routing plugin which receives the new whole context and returns just where to send it, and the harness does that. no http proxy. obviously some complications if you want to route from codex to anthropic or openrouter.
but we need to decouple the context building and routing decision from the actual http requests sending, we need to be able to insert "context/routing plugins" in the chain
Then, a bit like open router, it does a classifier job with a fast model to choose which one should process the turn.
In my case I usually don’t do local vs remote… although it can. Now I use it for thinking vs no-think against my preferred local model, which is a huge time saver even with the added classification step.
Does this interfere with cache hits? Could a single conversation or task span multiple roles?
Why are you building this? Does this maximize my toxen value by saving the hard tasks for the hard model? Does it maximize cache hits as part of its scoring? Does it help agents develop a specialist mindset? Are you anticipating users will have many local models hot, or is this also a model load/unload controller?
For this we don't just need a router, because the information to make detailed and accurate routing decisions currently doesn't exist. And there are no standards but every lab and maybe even inference providers have their own way of implementing reasoning, chat templates, cache, tool use and so on. All issues that make models non-interoperable.
What we need is applications that clearly specify their requests so they can be accurately routed to a provider, whether local or remote. And for that they need to use a standard protocol for model requests and intent.
I wrote a longer piece here: https://news.ycombinator.com/item?id=48706181
Interesting concept, work in theory, but I cannot see this being part of larger system.
I'm building another router for routing between local and remote models, ShowHN coming up later today. Here's a sneak preview of the github: https://github.com/try-works/role-model
- “after the client”
- “reverse proxy” (in front of servers)
- “proxy” (in front of client)
I always have to look this up, surely there must be a standardized way to describe this?/local fix my typo