undefined

upvote

points

by skybrian19 hours ago |

upvote

by macNchz18 hours ago|

[-]

This is something I've been thinking about for a while...the current state of things really does feel kind of like the dialup era, wondering what the "broadband" era could look like. Watching tokens stream in is reminiscent of watching a jpeg load a few rows of pixels at a time, and the various different loading and connecting animations that applications implemented before things got fast enough to make them less relevant.

Some of the work in that direction like Cerebras or Taalas have been doing is an interesting glimpse of what might be possible. In the meantime it's a fun thought experiment to wonder about what might be possible if even current state of the art models were available at like, a million tokens per second at a very low cost.

reply

upvote

by gavmor16 hours ago|

[-]

Take a look at https://chatjimmy.ai/ -- it's running against Taalas' "hardcore" silicon model, ie a dedicated, ASIC-like chip.

reply

upvote

by bikelang11 hours ago|

[-]

Wow - actually pretty astonishing how fast their inference is. So fast it feels fake?

reply

upvote

by qingcharles9 hours ago|

[-]

Yeah, when you find fast inference like that it almost feels like the answer arrives before you hit return. Now imagine it running locally with no server round-trip.

reply

upvote

by adamsmark8 hours ago|

[-]

Groq was the preview of the broadband era of LLMs for me. I remember asking a question on the demo site and the answer text showed up near instantly. Far faster than I could read. This was ~1 year ago and pre-acquisition.

reply

upvote

by garciasn18 hours ago|

[-]

You're right about it being reminiscent of the dial-up area, but I don't believe it's 300 to 1200; it's more like 4800:

Modem vs Claude according to Claude:

300 @ 2368 characters - 1m 19s

1200 @ 2368 characters - 19.7s

2400 @ 2368 characters - 9.9s

14.4K @ 2368 characters - 1.6s

33.6K @ 2368 characters - 705 ms

56K @ 2368 characters - 447 ms

Claude @ 2368 characters - 7.9s

reply

upvote

by jeffhuys18 hours ago|

[-]

Check chatjimmy.ai

reply

upvote

by lelandbatey16 hours ago|

[-]

https://chatjimmy.ai being a demo of the "burn the model to an ASIC" approach being sold by Taalas[0], an approach which they use to run Llama 3.1 8B at ~17000 tokens per second.

[0] - https://taalas.com/products/

reply

upvote

by snek_case9 hours ago|

[-]

Not to downplay their accomplishment but Llama 3.1 8B is a terrible model. It's really outdated at this point. It's cool that they were able to accelerate a model with silicon, but it also feels wasteful since llama 8B is such a useless model?

reply

upvote

by puilp05025 hours ago|

[-]

I guess their point was to demonstrate that it's possible to bake a decently-sized model to a silicon? As with anything related to HW, I guess the lead time will be considerably larger than the software counterparts, so I guess in 1-2 years timeframe we might see something like Gemma 4 baked onto a silicon.

reply

upvote

by leoedin3 hours ago|

[-]

Yeah, I think the important part is the process to convert the model to silicon, not the actual implementation itself.

Whether it succeeds now depends a lot on the rate of improvement of model architecture. They're betting on model design and capability improvements slowing down - and then wiping the floor with everyone else with their inference economics.

reply

upvote

by imtringued3 hours ago|

[-]

I agree, Gemma 3 12B is a very good model for its size and it was only obsoleted by Gemma 4.

Heck, I'm still a fan of Gemma 2 9B.

reply

upvote

by MagicMoonlight18 hours ago|

[-]

There was a startup posted here which built custom hardware that let the AI respond instantly. Thousands of tokens per second.

reply

upvote

by tln16 hours ago|

[-]

Taalas. A sibling comment of yours posted the chat demo URL -

https://chatjimmy.ai/

reply

upvote

by 2ndorderthought16 hours ago|

[-]

Woah. How is this working? It's stupid fast.

reply

upvote

by mike_hearn3 hours ago|

[-]

The weights are mapped directly to transistors. It's not a generic processor, it's literally a dedicated Llama 8B chip that can't be used for anything else. When you specialize in hardware you get faster - Taalas is pushing that to the limit.

They seem to be doing well. I checked recently and their API is closed to signups due to overwhelming demand.

reply

upvote

by Grosvenor17 hours ago|

[-]

cerebras

They built an entire wafer ASIC. The entire thing is one huge active ASIC. it takes a lot of cool engineering and cooling to make it work, and is very cool.

reply

upvote

by zargon18 hours ago|

[-]

Groq.

reply

upvote

by beavisringdin17 hours ago|

[-]

No, it was a custom ASIC chip with weights baked in for a singular model. I do envision a future where we return to cartridges. Local AI is de facto and massively optimised chips are built to be plug and play running a single SoTA model.

reply

upvote

by SJMG17 hours ago|

[-]

Likely https://taalas.com

reply

upvote

by 15 hours ago|

[-]

deleted

reply