undefined

points

[-]

But if you're running it on your own hardware you might only generate tokens when you have something useful to do with them, instead of every time you load a Google search results page because Google decided the future is stuffing Gemini-generated answers down your eyeballs instead of letting you read it yourself from the primary source for 0.1 watts.

by stavros51 minutes ago|

parent|

[-]

Don't worry, capitalism takes care of that.

by menno-sh15 hours ago|

prev|

[-]

If LLM's were a mature product then this would be true at some point. However, you could argue (and I will) that the popularization of on-device LLM inference will lead to two things:

- Consumers of LLM inference (developers and hobbyists) will be more aware of compute cost, leading them to develop more token-efficient uses of LLM inference and be incentivized to pick the right model for the right job (instead of throwing Sonnet at the wall and follow up with Opus if that doesn't stick)

- A larger market for on-device (and therefore open weight) LLM's will probably result in more research concentrated on those inherently more efficient (because compute/memory-constrained) models.

I think that despite the inefficiencies, shifting the market towards local inference would be a net positive in terms of energy use. Remember that 50W might seem like a lot, but is still much less than what, let's say, a PS5 draws.

Also remember how AWS had the same promise and now we're just deploying stack after stack and need 'FinOps' teams to get us to be more resource-efficient?

by aeonfox11 hours ago|

prev|

[-]

Separate to the self-host/datacentre argument, it would be interesting to see a speed/performance/watts-per-token leaderboard between leading models. Which model is the most watt-efficient?

by ifeot2 hours ago|

parent|

[-]

Akbaruddin

by airstrike14 hours ago|

prev|

[-]

This is neither a controversial take nor a reason to prefer third-party hosting over self-hosting, so I don't think the internet really needs to be ready for it.

by cortesoft15 hours ago|

prev|

[-]

I thought this is a pretty generally accepted fact?

by crazygringo10 hours ago|

parent|

[-]

I've seen plenty of people on HN claim that LLM's running on their phones is the obvious future in terms of not just privacy but also efficiency, i.e. better along every possible metric.

They don't usually go into much detail, but the impression I get is that they think data centers are energy monsters full of overheated GPU's that need to be constantly replaced, while your phone is full of mostly unused compute capacity and will barely break a sweat if it's only serving queries for a single user at a time.

They don't seem to give much thought to the energy usage per user (or what this will potentially do to your phone battery), or how different phone-sized vs data center-sized models are in terms of capability.

by 17 hours ago|

prev|

[-]

deleted

by drob51815 hours ago|

prev|

[-]

This is pretty much true for all applications.

by Onavo17 hours ago|

prev|

[-]

There's a bunch of companies doing garage GPU datacenters now. Probably can act as a heat source during winter too if you have a heat pump.

by kristianp11 hours ago|

parent|

[-]

That's an interesting idea [1], the value being that its easier to build servers into a bunch of homes that are being built than building a datacenter. Every now and then something reminds me of "Dad's Nuke", a novel by Marc Laidlaw, about a family that has a nuclear reactor in their basement. A really bizarre, memorable satire [2].

[1] https://finance.yahoo.com/sectors/technology/articles/nvidia...

[2] https://en.wikipedia.org/wiki/Dad%27s_Nuke

by 16 hours ago|

prev|

[-]

deleted

by Lalabadie16 hours ago|

prev|

[-]

Using only this dimension in a vacuum, it sounds like an easy choice, but we're extremely early in this market, and the big providers are already a mess of pricing choices, pricing changes, and sudden quota adjustments for consumers.

Plus, a Mac that's not running inference idles down to 1-5W, only drawing power when it needs to. Datacenters must maximize usage, individuals and their devices don't have to.

A Mac is also the rest of the personal computer!

by j_maffe15 hours ago|

parent|

[-]

But it's simply an economic fact that EoS will be more efficient with a task that's so easy to offload somewhere else.