- Consumers of LLM inference (developers and hobbyists) will be more aware of compute cost, leading them to develop more token-efficient uses of LLM inference and be incentivized to pick the right model for the right job (instead of throwing Sonnet at the wall and follow up with Opus if that doesn't stick)
- A larger market for on-device (and therefore open weight) LLM's will probably result in more research concentrated on those inherently more efficient (because compute/memory-constrained) models.
I think that despite the inefficiencies, shifting the market towards local inference would be a net positive in terms of energy use. Remember that 50W might seem like a lot, but is still much less than what, let's say, a PS5 draws.
Also remember how AWS had the same promise and now we're just deploying stack after stack and need 'FinOps' teams to get us to be more resource-efficient?
They don't usually go into much detail, but the impression I get is that they think data centers are energy monsters full of overheated GPU's that need to be constantly replaced, while your phone is full of mostly unused compute capacity and will barely break a sweat if it's only serving queries for a single user at a time.
They don't seem to give much thought to the energy usage per user (or what this will potentially do to your phone battery), or how different phone-sized vs data center-sized models are in terms of capability.
[1] https://finance.yahoo.com/sectors/technology/articles/nvidia...
Plus, a Mac that's not running inference idles down to 1-5W, only drawing power when it needs to. Datacenters must maximize usage, individuals and their devices don't have to.
A Mac is also the rest of the personal computer!