there are tasks that inherently benefit from being centralised away, like say coordination of peers across a large area - and there are tasks that strongly benefit from being as close to the user as possible, like low latency tasks and privacy/control-centred tasks
simultaneously, there's an overlapping pull to either side caused by the monetary interests of corporations vs users - corporations want as much as possible under their control, esp. when it's monetisable information but most things are at volume, and users want to be the sole controller of products esp. when they pay for them
we had dumb terminals already being pushed in the 1960s, the "cloud", "edge computing" and all forms of consolidation vs segregation periods across the industry, it's not going to stop because there's money to be made from the inherent advantages of those models and even the industry leaders cannot prevent these advantages from getting exploited by specialist incumbents
once leaders consolidates, inevitably they seek to maximise profit and in doing so they lower the barrier for new alternatives
ultimately I think the market will never stop demanding just having your own *** computer under your control and hopefully own it, and only the removal of this option will stop this demand; while businesses will never stop trying to control your computing, and providing real advantages in exchange for that, only to enter cycles of pushing for growing profitability to the point average users keep going back and forth
The ideal world would be an edge network like Cloudflare for LLMs so a nearby POP serves your requests. I’m not sure how viable this is. On classic hardware I think it would require massive infra buildout, but maybe ASICs could be the key to making this viable.
Unfortunately, as with most of the AI providers, it's wherever they've been able to find available power and capacity. They've contracts with all of the large cloud vendors and lack of capacity is significant enough of an issue that locality isn't really part of the equation.
The only things they're particular about locality for is the infrastructure they use for training runs, where they need lots of interconnected capacity with low latency links.
Inference is wherever, whenever. You could be having your requests processed halfway around the world, or right next door, from one minute to the next.
Wow, any source for this? It would explain why they vary between feeling really responsive and really delayed.