undefined

points

by _air6 hours ago |

comments

by Tade05 hours ago|

[-]

Only way to have hardware reach this sort of efficiency is to embed the model in hardware.

This exists[0], but the chip in question is physically large and won't fit on a phone.

[0] https://www.anuragk.com/blog/posts/Taalas.html

by tclancy5 hours ago|

parent|

[-]

I think you're ignoring the inevitable march of progress. Phones will get big enough to hold it soon.

by tren_hard3 hours ago|

parent|

[-]

Instead of slapping on an extra battery pack, it will be an onboard llm model. Could have lifecycles just like phones.

Getting bigger (foldable) phones, without losing battery life, and running useable models in the same form-factor is a pretty big ask.

by RALaBarge4 hours ago|

parent|

prev|

[-]

I think the future is the model becoming lighter not the hardware becoming heavier

by Tade04 hours ago|

parent|

[-]

The hardware will become heavier regardless I'm afraid.

by ottah5 hours ago|

parent|

prev|

[-]

That's actually pretty cool, but I'd hate to freeze a models weights into silicon without having an incredibly specific and broad usecase.

by patapong4 hours ago|

parent|

[-]

Depends on cost IMO - if I could buy a Kimi K2.5 chip for a couple of hundred dollars today I would probably do it.

by 4 hours ago|

parent|

prev|

[-]

deleted

by whatever14 hours ago|

parent|

prev|

[-]

I mean if it was small enough to fit in an iPhone why not? Every year you would fabricate the new chip with the best model. They do it already with the camera pipeline chips.

by superxpro124 hours ago|

parent|

prev|

[-]

Sounds like just the sort of thing FGPA's were made for.

The $$$ would probably make my eyes bleed tho.

by chrsw3 hours ago|

parent|

[-]

Current FPGAs would have terrible performance. We need some new architecture combining ASIC LLM perf and sparse reconfiguration support maybe.

by 0x4573 hours ago|

parent|

prev|

[-]

Wouldn't it be the opposite of freezing weights?

by intrasight5 hours ago|

parent|

prev|

[-]

I think for many reasons this will become the dominant paradigm for end user devices.

Moore's law will shrink it to 8mm soon. I think it'll be like a microSD card you plug in.

Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.

by bigyabai5 hours ago|

parent|

[-]

One big bottleneck is SRAM cost. Even an 8b model would probably end up being hundreds of dollars to run locally on that kind of hardware. Especially unpalatable if the model quality keeps advancing year-by-year.

> Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.

It's amazing to me that people consider this to be more realistic than FAANG collaborating on a CUDA-killer. I guess Nvidia really does deserve their valuation.

by intrasight5 hours ago|

parent|

[-]

> bottleneck is SRAM cost

Not for this approach

by 4 hours ago|

parent|

[-]

deleted

by ankaz2 hours ago|

parent|

prev|

[-]

[dead]

by originalvichy5 hours ago|

prev|

[-]

On smartphones? It’s not worth it to run a model this size on a device like this. A smaller fine-tuned model for specific use cases is not only faster, but possibly more accurate when tuned to specific use cases. All those gigs of unnecessary knowledge are useless to perform tasks usually done on smartphones.

by root_axis3 hours ago|

prev|

[-]

It will never be possible on a smart phone. I know that sounds cynical, but there's basically no path to making this possible from an engineering perspective.

by NetMageSCW26 minutes ago|

parent|

[-]

No one needs more than 640K!

by svachalek4 hours ago|

prev|

[-]

A long time. But check out Apollo from Liquid AI, the LFM2 models run pretty fast on a phone and are surprisingly capable. Not as a knowledge database but to help process search results, solve math problems, stuff like that.

by ottah5 hours ago|

prev|

[-]

Probably 15 to 20 years, if ever. This phone is only running this model in the technical sense of running, but not in a practical sense. Ignore the 0.4tk/s, that's nothing. What's really makes this example bullshit is the fact that there is no way the phone has a enough ram to hold any reasonable amount of context for that model. Context requirements are not insignificant, and as the context grows, the speed of the output will be even slower.

Realistically you need +300GB/s fast access memory to the accelerator, with enough memory to fully hold at least greater than 4bit quants. That's at least 380GB of memory. You can gimmick a demo like this with an ssd, but the ssd is just not fast enough to meet the minim specs for anything more than showing off a neat trick on twitter.

The only hope for a handheld execution of a practical, and capable AI model is both an algorithmic breakthrough that does way more with less, and custom silicon designed for running that type of model. The transformer architecture is neat, but it's just not up for that task, and I doubt anyone's really going to want to build silicon for it.

by alwillis3 hours ago|

parent|

[-]

> Realistically you need +300GB/s fast access memory to the accelerator, with enough memory to fully hold at least greater than 4bit quants.

The latest M5 MacBook Pro's start at 307 GB/s memory bandwidth, the 32-core GPU M5 Max gets 460 GB/s, and the 40-core M5 Max gets 614 GB/s. The CPU, GPU, and Neural Engine all share the memory.

The A19/A19 Pro in the current iPhone 17 line is essentially the same processor (minus the laptop and desktop features that aren’t needed for a phone), so it would seem we're not that far off from being able to run sophisticated AI models on a phone.

by iooi4 hours ago|

prev|

[-]

Is 100 t/s the stadard for models?