I often remind people two orders of quantitative change is a qualitative change.
> The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.
The real product they have is automation. They figured out a way to compile a large model into a circuit. That's, in itself, pretty impressive. If they can do this, they can also compile models to an HDL and deploy them to large FPGA simulators for quick validation. If we see models maturing at a "good enough" state, even a longer turnaround between model release and silicon makes sense.
While I also see lots of these systems running standalone, I think they'll really shine combined with more flexible inference engines, running the unchanging parts of the model while the coupled inference engine deals with whatever is too new to have been baked into silicon.
I'm concerned with the environmental impact. Chip manufacture is not very clean and these chips will need to be swapped out and replaced at a cadence higher than we currently do with GPUs.
The design ip at 6nm is still tough; I feel like this team must have at least one real genius and some incredibly good support at tsmc. Or they’ve been waiting a year for a slot :)
"Ljubisa Bajic desiged video encoders for Teralogic and Oak Technology before moving over to AMD and rising through the engineering ranks to be the architect and senior manager of the company’s hybrid CPU-GPU chip designs for PCs and servers. Bajic did a one-year stint at Nvidia as s senior architect, bounced back to AMD as a director of integrated circuit design for two years, and then started Tenstorrent."
His wife (COO) worked at Altera, ATI, AMD and Testorrent.
"Drago Ignjatovic, who was a senior design engineer working on AMD APUs and GPUs and took over for Ljubisa Bajic as director of ASIC design when the latter left to start Tenstorrent. Nine months later, Ignjatovic joined Tenstorrent as its vice president of hardware engineering, and he started Taalas with the Bajices as the startup’s chief technology officer."
Not a youngster gang...
There's already some good work on router benchmarking which is pretty interesting
Abundance supports different strategies. One approach: Set a deadline for a response, send the turn to every AI that could possibly answer, and when the deadline arrives, cancel any request that hasn't yet completed. You know a priori which models have the highest quality in aggregate. Pick that one.
I'm out of the loop on training LLMs, but to me it's just pure data input. Are they choosing to include more code rather than, say fiction books?
From there you go to RL training, where humans are grading model responses, or the AI is writing code to try to pass tests and learning how to get the tests to pass, etc. The RL phase is pretty important because it's not passive, and it can focus on the weaker areas of the model too, so you can actually train on a larger dataset than the sum of recorded human knowledge.
I desperately want there to be differentiation. Reality has shown over and over again it doesn’t matter. Even if you do same query across X models and then some form of consensus, the improvements on benchmarks are marginal and UX is worse (more time, more expensive, final answer is muddied and bound by the quality of the best model)
Families of model sizes work great for speculative decoding. Use the 1B with the 32B or whatever.
It's a balance as you want it to be guessing correctly as much as possible but also be as fast as possible. Validation takes time and every guess needs to be validated etc
The model you're using to speculate could be anything, but if it's not guessing what the main model would predict, it's useless.
Afaik it can work with anything, but sharing vocab solves a lot of headaches and the better token probs match, the more efficient it gets.
Which is why it is usually done with same family models and most often NOT just different quantizations of the same model.
> to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model
suggests there is something I'm unaware of. If you compare the small and big model, don't you have to wait for the big model anyway and then what's the point? I assume I'm missing some detail here, but what?
More info:
* https://research.google/blog/looking-back-at-speculative-dec...
* https://pytorch.org/blog/hitchhikers-guide-speculative-decod...
https://research.google/blog/speculative-cascades-a-hybrid-a...