undefined

points

[-]

> IDK how the custom hardware exploits this; would love to hear any ideas!

You might like this article [1], titled "FPGA-based CNN Acceleration using Pattern-Aware Pruning". More context and details can be found in the PhD thesis of Léo Pradels [2].

[1]: https://inria.hal.science/hal-04689673/document

[2]: https://theses.hal.science/tel-05021575v1/file/PRADELS_Leo.p...

by fulafel3 hours ago|

prev|

[-]

Current accelerators (TPUs, various onchip NPUs) are something close to this. Systolic array is the estabilished computer architecture term for flowing data from computation to computation without the overhead of a register file or von Neumann bottleneck.

by cm218719 hours ago|

prev|

[-]

Random thought. Once models stabilise, could you possibly hardcode the model in gates? Or are they too large for a single chip?

by 8note18 hours ago|

parent|

[-]

https://www.anuragk.com/blog/posts/Taalas.html

by lsaferite14 hours ago|

parent|

prev|

[-]

https://taalas.com/

by jwHollister13 hours ago|

parent|

[-]

wow if they can get something like this working, what happens to all this infrastructure? Hyperscalers have to be assuming the lifespan of that stuff wrong considering the next gen will be 1000x more efficient.

by otterley13 hours ago|

parent|

[-]

The question isn’t whether it works (it does); the question is whether there are buyers for hardware that is obsolete the day it ships. Models evolve much more quickly than hardware can keep up.

by simondotau12 hours ago|

parent|

[-]

Presumably at some point the rapid progress of models will plateau, at least insofar as a model could be frozen in time and remain economically useful for the expected life of hardware. Especially if it comes with compelling benefits e.g. dramatically lower latency and/or dramatically higher performance per watt.

If you can build chips that could run one specific LLM 100x faster than anything else, it would have a use case that nothing else could match.

by lsaferite5 hours ago|

parent|

[-]

Those taalus chips apparently run at 1/10 the power as the current SOTA GPU setups. If they can execute even partially on their plan, it'll be a literal game changer.

by fragmede11 hours ago|

parent|

prev|

[-]

https://www.cerebras.ai/ is exactly that! Holy shit it's fast.

by otterley6 hours ago|

parent|

[-]

Cerebras is not that. Cerebras isn’t tied to a particular model like Taalas is. The latter is even faster than Cerebras.

by wrsh076 hours ago|

parent|

prev|

[-]

Right, but there exist problems that need to be routinely solved and can be solved on glm 5.2. is the model state of the art when it is published? No. But when it comes out you could optimize it and let your solver run forever for quite cheap, and that could be useful if the only problems you want it to solve (for cheap) are solvable by that model.

And the high water mark of what can be solved by open models will keep going up.

by indigo94510 hours ago|

parent|

prev|

[-]

One obvious use case is edge computing, such as in industrial applications that cannot tolerate the risk of a network link or cloud service going down. Even embedded use cases are possible, such as an image classifier model in a security camera.

by cm21879 hours ago|

parent|

[-]

In fact any application where the task is stable and the model good enough to address that task. As you suggest, industrial applications where a robot must deal with variants of the same repetitive task. Or a military drone which needs to be jamming proof.

by Someone8 hours ago|

parent|

[-]

> Or a military drone which needs to be jamming proof.

That, if used in war, I would think, would need the ability to be updated frequently. For example, your enemy might find out (say by running tests on hardware they captured from you) that painting some red paint in a particular shape (a smiley might even work) on their hardware prevented your drones from attacking them because it confuses that pattern with the Red Cross logo.

by 5423542342352 hours ago|

parent|

[-]

Those are really two different things. One is the computer vision that could be “hard coded” and the other is the image library, that would be updated regularly. Look at facial recognition. You can download and run a facial recognition LLM on your GPU that looks at a library of your personal photos. The LLM doesn’t change when it scans your photos for faces, it just writes the data associated with a “face” to whatever library. When you add a new picture, it adds that face data and compares it to the library for a match. The actual LLM never needs to change. It is the same as the one I downloaded and ran on my GPU for my photos. If it was written on chips we both bought and installed, it would work the same way.[1]

[1] Yes, this is a massive simplification

by TeMPOraL4 hours ago|

parent|

prev|

[-]

You keep the "reasoning core" burned and play the cat-and-mouse game at the I/O edge. Enemy invents a smiley shield, your R&D figures out some filtering step that defeats this effect without compromising general image recognition. Then the enemy figures out a new trick, your R&D invents a countermeasure, and so on - point is, this can happen for a long time in layers on top of the core model. If the enemy invents some robust way to attack the core that cannot be filtered out, it's game over for that hardware, but that is a much more difficult task and might take longer than expected service time of a given batch of drones.

by SoftTalker2 hours ago|

parent|

[-]

Sort of mirrors how biological organisms work. E.g. in a bird, the core functionality of knowing how to fly is burned in. Hunting food is probably a combination of experiential learning on top of instinctive behavior, and is somewhat adaptable to local conditions.

by largbae5 hours ago|

parent|

prev|

[-]

There may be all sorts of stable use case models that this could be interesting for. Imagine permanent voice translation circuits at a tiny fraction of the current price, glasses that subtitle the world with long battery life.

by lsaferite6 hours ago|

parent|

prev|

[-]

They are betting on fast release cycles coupled with much lower costs (purchase and operations) mixed with the ability to have dynamic fine tunes on top of the static model.

by SwellJoe9 hours ago|

parent|

prev|

[-]

The models have to run on something or they're useless. They can't run on future hardware today, and people want to use models today. So, if hardware is obsolete the day it ships, we're all using obsolete hardware, and there's no alternative to that.

by otterley6 hours ago|

parent|

[-]

Taalas encodes the model into the hardware itself. The two are inextricably coupled. It’s like buying a CNC router that can’t be reprogrammed to build anything other than a specific predetermined kitchen cabinet. And the model used inside is frozen many months before the hardware ships, since the process from tapeout to production takes that long.

In contrast, tomorrow’s models will typically run, although perhaps more slowly, on general-purpose inference hardware that was released today or even years ago.