undefined

points

[-]

I believe this is a CPU/GPU vs ASIC comparison, rather than CPU vs GPU. They have always(ish) coexisted, being optimized for different things: ASICs have cost/speed/power advantages, but the design is more difficult than writing a computer program, and you can't reprogram them.

Generally, you use an ASIC to perform a specific task. In this case, I think the takeaway is the LLM functionality here is performance-sensitive, and has enough utility as-is to choose ASIC.

by RobotToaster10 hours ago|

parent|

[-]

It reminds me of the switch from GPUs to ASICs in bitcoin mining. I've been expecting this to happen.

by yunohn6 hours ago|

parent|

[-]

But the BTC mining algorithm has not and will not change. That’s the only reason ASICs atleast make a bit of sense for crypto.

AI being static weights is already challenged with the frequent model updates we already see - but may even be a relic once we find a new architecture.

by fxnn4 hours ago|

parent|

[-]

We can expect the model landscape to consolidate some day. Progress will become slower, innovations will become smaller. Not tomorrow, not next year, but the time will come.

And then it'll increasingly make sense to build such a chip into laptops, smartphones, wearables. Not for high-end tasks, but to drive the everyday bread-and-butter tasks.

by yunohn3 hours ago|

parent|

[-]

The world continues to evolve, in a way that requires flexibility - not more constraints. I just fail to see a future where we want less general purpose computers, and more hard-wired ones? Would be interesting to be proven wrong though!

by dangus5 hours ago|

parent|

prev|

[-]

Sounds to me like there’s potential to use these for established models to provide cost/scale advantage while frontier models will run in the existing setup.

by yunohn5 hours ago|

parent|

[-]

IME llama et all require LoRA or fine-tuning to be usable. That's their real value vs closed source massive models, and their small size makes this possible, appealing, and doable on a recurring basis as things evolve. Again, rendering ASICs useless.

by fxnn4 hours ago|

parent|

[-]

Read the blog post. It mentions that their chip has a small SRAM which can store LoRA.

by yunohn3 hours ago|

parent|

[-]

Neither the blog nor Taalas' original post specify what speed to expect when using the SRAM in conjunction with the baked-in weights? To be taken seriously, that is really necessary to explain in detail, than a passing mention.

by hkt7 hours ago|

parent|

prev|

[-]

Heh, I said this exact thing in another thread the other day. Nice to see I wasn't the only one thinking it.

by GTP11 hours ago|

parent|

prev|

[-]

The middle ground here would be an FPGA, but I belive you would need a very expensive one to implement an LLM on it.

by dogma113810 hours ago|

parent|

[-]

FPGAs would be less efficient than GPUs.

FPGAs don’t scale if they did all GPUs would’ve been replaced by FPGAs for graphics a long time ago.

You use an FPGA when spinning a custom ASIC doesn’t makes financial sense and generic processor such as a CPU or GPU is overkill.

Arguably the middle ground here are TPUs, just taking the most efficient parts of a “GPU” when it comes to these workloads but still relying on memory access in every step of the computation.

by jgalt2129 hours ago|

parent|

[-]

I thought it was because the number logic elements in a GPU is orders of magnitude higher than in a FPGA, rather than just processing speed. And GPU processing is inherently parallel so the GPU beats the FPGA just based on transistor count.

by JKCalhoun11 hours ago|

prev|

[-]

"This has been demonstrated already…"

I think burning the weights into the gates is kinda new.

("Weights to gates." "Weighted gates"? "Gated weights"?)

by Zetaphor8 hours ago|

parent|

[-]

Is this not effectively the same thing as a Bitcoin ASIC?

by brookst8 hours ago|

parent|

prev|

[-]

Geights? Wates?

by learn_more7 hours ago|

parent|

[-]

gweights

by dogma113810 hours ago|

parent|

prev|

[-]

Not really new, this is 80’s-90’s Neuron MOS Transistor.

It’s also not that different than how TPUs work where they have special registers in their PEs for weights.

by rembal8 hours ago|

prev|

[-]

It's not certain this is the future: the obvious trade off is lack of flexibility: not only when a new model comes out, but also varying demand in the data centers - one day people want more LLM queries, another day more diffusion queries. Aaand, this blocks the holly grail of self improving models, beyond in-context learning. A realistic use case? More efficient vision based drone targeting in Ukraine/Taiwan/ whatevers next. That's the place where energy efficiency, processing speed, and also weight is most critical. Not sure how heavy ASICS are though, bit they should be proportional to the model size. I heard many complaints about onboard AI 'not being there yet', and this may change it. Not listing middle east as there is no serious jamming problem there.

by darkwater7 hours ago|

parent|

[-]

In a not-too-distant future (5 years?) small LLMs will be good enough to be used as generic models for most tasks. And if you have a dedicated ASIC small enough to fit in an iPhone, you have a truly local AI device with the bonus point that you get something really new to sell in every new generation (i.e. acces to an even more powerful model)

by wmf5 hours ago|

parent|

[-]

The Taalas approach is much more expensive than the NPU that phones already have.

by slow_typist4 hours ago|

parent|

[-]

Yes but not in five years. The chips will be dirt cheap by then. We‘ll get “intelligent” washing machines that will discuss the amount of detergent and eventually berate us. Toasters with voice input. And really annoying elevators. Also bugs that keep an extremely low RF profile (only phoning home when the target is talking business).

by wmf3 hours ago|

parent|

[-]

No, Taalas requires more silicon which will always cost more than storing weights in DRAM.

by throwthrowuknow7 hours ago|

parent|

prev|

[-]

it doesn’t need to go in the phone if it only takes a few milliseconds to respond and is cheap

by yunwal5 hours ago|

parent|

[-]

Perceptible latency is somewhere between 10 and 100ms. Even if an LLM was hosted in every aws region in the world, latency would likely be annoying if you were expecting near-realtime responses (for example, if you were using an llm as autocomplete while typing). If, say, apple had an LLM on a chip any app could use some SDK to access, it could feasibly unlock a whole bunch of usecases that would be impractical with a network call.

Also, offline access is still a necessity for many usecases. If you have something like an autocomplete feature that stops working when you're on the subway, the change in UX between offline and online makes the feature more disruptive than helpful.

https://www.cloudping.co/

by hamdingers7 hours ago|

parent|

prev|

[-]

It does if you care about who can access to your tokens

by iugtmkbdfil8346 hours ago|

parent|

prev|

[-]

The real benefit, to a very particular type of mind, is that the alignment will be baked in ( presumably a lot robust than today ) and wrongthink will be eliminated once and for all. It will also help flagging anyone, who would need anything as dangerous as custom, uncensored models. Win/win.

To your point, its neat tech, but the limitations are obvious since 'printing' only one LLM ensures further concentration of power. In other words, history repeats itself.

by luckydata8 hours ago|

parent|

prev|

[-]

It doesn't have be to true for all models to be useful. Thinking about small models running on phones or edge devices deployed in the field that would be a perfect use case for a "printed model".

by pwarner9 hours ago|

prev|

[-]

I'd be kind of shocked if Nvidia isn't playing with this.

I don't expect it's like super commercially viable today, but for sure things need to trend to radically more efficient AI solutions.

by saati8 hours ago|

parent|

[-]

These are chips that become e-waste the second a better a model comes out, and nvidia is already limited by TSMC capacity.

by hamdingers7 hours ago|

parent|

[-]

This is a ridiculous mindset. Llama 3.1 8B can do lots of things today and it'll still be able to do those things tomorrow.

If you baked one of these into a smart speaker that could call tools to control lights and play music, it will still be able to do that when Llama 4 or 5 or 6 comes out.

by bigyabai6 hours ago|

parent|

[-]

If you pay $1,500 for a Mistral ASIC that is beaten by a $15 Qwen ASIC that comes out six months later, you'd be feeling pretty dang ridiculous.

by hamdingers5 hours ago|

parent|

[-]

I'm equally capable of making up numbers to support my perspective but I don't see the point.

by bigyabai5 hours ago|

parent|

[-]

The point is that the GP's mindset is not very ridiculous if you value things by a price/utility ratio. Software and hardware advancements will lead to buyer's remorse faster than people get an ROI from local inference.

by darkwater5 hours ago|

parent|

[-]

SW and HW advancements will bring this topic in the "good enough for vast majority" field, thus making GP point moot. You don't care if your LLM ASIC chip is not the latest one because it works for the use you purchased it for. The highly dynamical nature of LLM itself will make part of the advantage of upgradable software not that interesting anymorw. [1]

[1] although security might be a big enough reason for upgrades to still be required

by sowbug6 hours ago|

parent|

prev|

[-]

They'll be perfect for an appliance like the Rick and Morty butter robot.

by throwthrowuknow7 hours ago|

parent|

prev|

[-]

these aren’t made for general chatbot use

by cyanydeez7 hours ago|

parent|

prev|

[-]

Only in VC backed funding land.

In the real world, theres talking refrigerators who dont need to know how to recite shakespeare.

by HPsquared7 hours ago|

parent|

[-]

On the upside, Shakespeare isn't going to change soon.

by MarsIronPI6 hours ago|

parent|

[-]

So you're saying we should burn Shakespeare onto a chip? /s

by MarsIronPI6 hours ago|

prev|

[-]

Doesn't Google have custom TPUs that are kind of a halfway point between Taalas' approach and a generic GPU? I wonder if that kind of hardware will reach consumers. It probably will, though as I understand them NPUs aren't quite it.

by theptip8 hours ago|

prev|

[-]

Are people surprised?

I think the interesting point is the transition time. When is it ROI-positive to tape out a chip for your new model? There’s a bunch of fun infra to build to make this process cheaper/faster and I imagine MoE will bring some challenges.

by IshKebab11 hours ago|

prev|

[-]

> Because we did this exact same transition from running in software to largely running in hardware for all 2D and 3D Computer Graphics.

We transitioned from software on CPUs to fixed GPU hardware... But then we transitioned back to software running on GPUs! So there's no way you can say "of course this is the future".

by dyauspitr6 hours ago|

prev|

[-]

Job specific ASICs are are “old as time.”