undefined

upvote

points

by vessenes11 hours ago |

upvote

by rbanffy9 hours ago|

[-]

> any factor of 10 being a new science / new product category,

I often remind people two orders of quantitative change is a qualitative change.

> The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.

The real product they have is automation. They figured out a way to compile a large model into a circuit. That's, in itself, pretty impressive. If they can do this, they can also compile models to an HDL and deploy them to large FPGA simulators for quick validation. If we see models maturing at a "good enough" state, even a longer turnaround between model release and silicon makes sense.

While I also see lots of these systems running standalone, I think they'll really shine combined with more flexible inference engines, running the unchanging parts of the model while the coupled inference engine deals with whatever is too new to have been baked into silicon.

I'm concerned with the environmental impact. Chip manufacture is not very clean and these chips will need to be swapped out and replaced at a cadence higher than we currently do with GPUs.

reply

upvote

by ttul8 hours ago|

[-]

Having dabbled in VLSI in the early-2010s, half the battle is getting a manufacturing slot with TSMC. It’s a dark art with secret handshakes. This demonstrator chip is an enormous accomplishment.

reply

upvote

by vessenes3 hours ago|

[-]

Yeah and a team I’m not familiar with — I didn’t check bios but they don’t lead with ‘our team made this or that gpu for this or that bigco’.

The design ip at 6nm is still tough; I feel like this team must have at least one real genius and some incredibly good support at tsmc. Or they’ve been waiting a year for a slot :)

reply

upvote

by dust422 hours ago|

[-]

From the article:

"Ljubisa Bajic desiged video encoders for Teralogic and Oak Technology before moving over to AMD and rising through the engineering ranks to be the architect and senior manager of the company’s hybrid CPU-GPU chip designs for PCs and servers. Bajic did a one-year stint at Nvidia as s senior architect, bounced back to AMD as a director of integrated circuit design for two years, and then started Tenstorrent."

His wife (COO) worked at Altera, ATI, AMD and Testorrent.

"Drago Ignjatovic, who was a senior design engineer working on AMD APUs and GPUs and took over for Ljubisa Bajic as director of ASIC design when the latter left to start Tenstorrent. Nine months later, Ignjatovic joined Tenstorrent as its vice president of hardware engineering, and he started Taalas with the Bajices as the startup’s chief technology officer."

Not a youngster gang...

reply

upvote

by VagabundoP9 hours ago|

[-]

There might be a foodchain of lower order uses when they become "obsolete".

reply

upvote

by rbanffy7 hours ago|

[-]

I think there will be a lot of space for sensorial models in robotics, as the laws of physics don't change much, and a light switch or automobile controls have remained stable and consistent over the last decades.

reply

upvote

by Gareth32110 hours ago|

[-]

I think the next major innovation is going to be intelligent model routing. I've been exploring OpenClaw and OpenRouter, and there is a real lack of options to select the best model for the job and execute. The providers are trying to do that with their own models, but none of them offer everything to everyone at all times. I see a future with increasingly niche models being offered for all kinds of novel use cases. We need a way to fluidly apply the right model for the job.

reply

upvote

by nylonstrung10 hours ago|

[-]

Agree that routing is becoming the critical layer here. Vllm iris is really promising for this https://blog.vllm.ai/2026/01/05/vllm-sr-iris.html

There's already some good work on router benchmarking which is pretty interesting

reply

upvote

by condiment8 hours ago|

[-]

At 16k tokens/s why bother routing? We're talking about multiple orders of magnitude faster and cheaper execution.

Abundance supports different strategies. One approach: Set a deadline for a response, send the turn to every AI that could possibly answer, and when the deadline arrives, cancel any request that hasn't yet completed. You know a priori which models have the highest quality in aggregate. Pick that one.

reply

upvote

by IanCal6 hours ago|

[-]

The best coding model won’t be the best roleplay one which won’t be the best at tool use. It depends what you want to do in order to pick the best model.

reply

upvote

by PhunkyPhil5 hours ago|

[-]

I'm not saying you're wrong, but why is this the case?

I'm out of the loop on training LLMs, but to me it's just pure data input. Are they choosing to include more code rather than, say fiction books?

reply

upvote

by jmalicki3 hours ago|

[-]

There is the pre-training, where you passively read stuff from the web.

From there you go to RL training, where humans are grading model responses, or the AI is writing code to try to pass tests and learning how to get the tests to pass, etc. The RL phase is pretty important because it's not passive, and it can focus on the weaker areas of the model too, so you can actually train on a larger dataset than the sum of recorded human knowledge.

reply

upvote

by refulgentis5 hours ago|

[-]

I’ll go ahead and say they’re wrong (source: building and maintaining llm client with llama.cpp integrated & 40+ 3p models via http)

I desperately want there to be differentiation. Reality has shown over and over again it doesn’t matter. Even if you do same query across X models and then some form of consensus, the improvements on benchmarks are marginal and UX is worse (more time, more expensive, final answer is muddied and bound by the quality of the best model)

reply

upvote

by monooso9 hours ago|

[-]

I came across this yesterday. Haven't tried it, but it looks interesting:

https://agent-relay.com/

reply

upvote

by eshaham7810 hours ago|

[-]

[dead]

reply

upvote

by btown11 hours ago|

[-]

For speculative decoding, wouldn’t this be of limited use for frontier models that don’t have the same tokenizer as Llama 3.1? Or would it be so good that retokenization/bridging would be worth it?

reply

upvote

by Zetaphor10 hours ago|

[-]

My understanding as well is that speculative decoding only works with a smaller quant of the same model. You're using the faster sampling of the smaller models representation of the larger models weights in order to attempt to accurately predict its token output. This wouldn't work cross-model as the token probabilities are completely different.

reply

upvote

by jasonjmcghee8 hours ago|

[-]

This is not correct.

Families of model sizes work great for speculative decoding. Use the 1B with the 32B or whatever.

It's a balance as you want it to be guessing correctly as much as possible but also be as fast as possible. Validation takes time and every guess needs to be validated etc

The model you're using to speculate could be anything, but if it's not guessing what the main model would predict, it's useless.

reply

upvote

by ashirviskas10 hours ago|

[-]

Smaller quant or smaller model?

Afaik it can work with anything, but sharing vocab solves a lot of headaches and the better token probs match, the more efficient it gets.

Which is why it is usually done with same family models and most often NOT just different quantizations of the same model.

reply

upvote

by vessenes10 hours ago|

[-]

I think they’d commission a quant directly. Benefits go down a lot when you leave model families.

reply

upvote

by joha427011 hours ago|

[-]

The guts of a LLM isn't something I'm well versed in, but

> to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model

suggests there is something I'm unaware of. If you compare the small and big model, don't you have to wait for the big model anyway and then what's the point? I assume I'm missing some detail here, but what?

reply

upvote

by connorbrinton11 hours ago|

[-]

Speculative decoding takes advantage of the fact that it's faster to validate that a big model would have produced a particular sequence of tokens than to generate that sequence of tokens from scratch, because validation can take more advantage of parallel processing. So the process is generate with small model -> validate with big model -> then generate with big model only if validation fails

More info:

* https://research.google/blog/looking-back-at-speculative-dec...

* https://pytorch.org/blog/hitchhikers-guide-speculative-decod...

reply

upvote

by sails10 hours ago|

[-]

See also speculative cascades which is a nice read and furthered my understanding of how it all works

https://research.google/blog/speculative-cascades-a-hybrid-a...

reply

upvote

by speedping11 hours ago|

[-]

Verification is faster than generation, one forward pass for verification of multiple tokens vs a pass for every new token in generation

reply

upvote

by vanviegen11 hours ago|

[-]

I don't understand how it would work either, but it may be something similar to this: https://developers.openai.com/api/docs/guides/predicted-outp...

reply

upvote

by ml_basics10 hours ago|

[-]

They are referring to a thing called "speculative decoding" I think.

reply

upvote

by cma11 hours ago|

[-]

When you predict with the small model, the big model can verify as more of a batch and be more similar in speed to processing input tokens, if the predictions are good and it doesn't have to be redone.

reply

upvote

by empath759 hours ago|

[-]

Think about this for solving questions in math where you need to explore a search space. You can run 100 of these for the same cost and time of doing one api call to open ai.

reply