undefined

upvote

points

by dust4212 hours ago |

upvote

by vessenes11 hours ago|

[-]

This math is useful. Lots of folks scoffing in the comments below. I have a couple reactions, after chatting with it:

1) 16k tokens / second is really stunningly fast. There’s an old saying about any factor of 10 being a new science / new product category, etc. This is a new product category in my mind, or it could be. It would be incredibly useful for voice agent applications, realtime loops, realtime video generation, .. etc.

2) https://nvidia.github.io/TensorRT-LLM/blogs/H200launch.html Has H200 doing 12k tokens/second on llama 2 12b fb8. Knowing these architectures that’s likely a 100+ ish batched run, meaning time to first token is almost certainly slower than taalas. Probably much slower, since Taalas is like milliseconds.

3) Jensen has these pareto curve graphs — for a certain amount of energy and a certain chip architecture, choose your point on the curve to trade off throughput vs latency. My quick math is that these probably do not shift the curve. The 6nm process vs 4nm process is likely 30-40% bigger, draws that much more power, etc; if we look at the numbers they give and extrapolate to an fp8 model (slower), smaller geometry (30% faster and lower power) and compare 16k tokens/second for taalas to 12k tokens/s for an h200, these chips are in the same ballpark curve.

However, I don’t think the H200 can reach into this part of the curve, and that does make these somewhat interesting. In fact even if you had a full datacenter of H200s already running your model, you’d probably buy a bunch of these to do speculative decoding - it’s an amazing use case for them; speculative decoding relies on smaller distillations or quants to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model.

Upshot - I think these will sell, even on 6nm process, and the first thing I’d sell them to do is speculative decoding for bread and butter frontier models. The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.

I hope these guys make it! I bet the v3 of these chips will be serving some bread and butter API requests, which will be awesome.

reply

upvote

by rbanffy9 hours ago|

[-]

> any factor of 10 being a new science / new product category,

I often remind people two orders of quantitative change is a qualitative change.

> The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.

The real product they have is automation. They figured out a way to compile a large model into a circuit. That's, in itself, pretty impressive. If they can do this, they can also compile models to an HDL and deploy them to large FPGA simulators for quick validation. If we see models maturing at a "good enough" state, even a longer turnaround between model release and silicon makes sense.

While I also see lots of these systems running standalone, I think they'll really shine combined with more flexible inference engines, running the unchanging parts of the model while the coupled inference engine deals with whatever is too new to have been baked into silicon.

I'm concerned with the environmental impact. Chip manufacture is not very clean and these chips will need to be swapped out and replaced at a cadence higher than we currently do with GPUs.

reply

upvote

by ttul8 hours ago|

[-]

Having dabbled in VLSI in the early-2010s, half the battle is getting a manufacturing slot with TSMC. It’s a dark art with secret handshakes. This demonstrator chip is an enormous accomplishment.

reply

upvote

by vessenes3 hours ago|

[-]

Yeah and a team I’m not familiar with — I didn’t check bios but they don’t lead with ‘our team made this or that gpu for this or that bigco’.

The design ip at 6nm is still tough; I feel like this team must have at least one real genius and some incredibly good support at tsmc. Or they’ve been waiting a year for a slot :)

reply

upvote

by dust422 hours ago|

[-]

From the article:

"Ljubisa Bajic desiged video encoders for Teralogic and Oak Technology before moving over to AMD and rising through the engineering ranks to be the architect and senior manager of the company’s hybrid CPU-GPU chip designs for PCs and servers. Bajic did a one-year stint at Nvidia as s senior architect, bounced back to AMD as a director of integrated circuit design for two years, and then started Tenstorrent."

His wife (COO) worked at Altera, ATI, AMD and Testorrent.

"Drago Ignjatovic, who was a senior design engineer working on AMD APUs and GPUs and took over for Ljubisa Bajic as director of ASIC design when the latter left to start Tenstorrent. Nine months later, Ignjatovic joined Tenstorrent as its vice president of hardware engineering, and he started Taalas with the Bajices as the startup’s chief technology officer."

Not a youngster gang...

reply

upvote

by VagabundoP9 hours ago|

[-]

There might be a foodchain of lower order uses when they become "obsolete".

reply

upvote

by rbanffy7 hours ago|

[-]

I think there will be a lot of space for sensorial models in robotics, as the laws of physics don't change much, and a light switch or automobile controls have remained stable and consistent over the last decades.

reply

upvote

by Gareth32110 hours ago|

[-]

I think the next major innovation is going to be intelligent model routing. I've been exploring OpenClaw and OpenRouter, and there is a real lack of options to select the best model for the job and execute. The providers are trying to do that with their own models, but none of them offer everything to everyone at all times. I see a future with increasingly niche models being offered for all kinds of novel use cases. We need a way to fluidly apply the right model for the job.

reply

upvote

by nylonstrung10 hours ago|

[-]

Agree that routing is becoming the critical layer here. Vllm iris is really promising for this https://blog.vllm.ai/2026/01/05/vllm-sr-iris.html

There's already some good work on router benchmarking which is pretty interesting

reply

upvote

by condiment8 hours ago|

[-]

At 16k tokens/s why bother routing? We're talking about multiple orders of magnitude faster and cheaper execution.

Abundance supports different strategies. One approach: Set a deadline for a response, send the turn to every AI that could possibly answer, and when the deadline arrives, cancel any request that hasn't yet completed. You know a priori which models have the highest quality in aggregate. Pick that one.

reply

upvote

by IanCal6 hours ago|

[-]

The best coding model won’t be the best roleplay one which won’t be the best at tool use. It depends what you want to do in order to pick the best model.

reply

upvote

by PhunkyPhil5 hours ago|

[-]

I'm not saying you're wrong, but why is this the case?

I'm out of the loop on training LLMs, but to me it's just pure data input. Are they choosing to include more code rather than, say fiction books?

reply

upvote

by jmalicki3 hours ago|

[-]

There is the pre-training, where you passively read stuff from the web.

From there you go to RL training, where humans are grading model responses, or the AI is writing code to try to pass tests and learning how to get the tests to pass, etc. The RL phase is pretty important because it's not passive, and it can focus on the weaker areas of the model too, so you can actually train on a larger dataset than the sum of recorded human knowledge.

reply

upvote

by refulgentis5 hours ago|

[-]

I’ll go ahead and say they’re wrong (source: building and maintaining llm client with llama.cpp integrated & 40+ 3p models via http)

I desperately want there to be differentiation. Reality has shown over and over again it doesn’t matter. Even if you do same query across X models and then some form of consensus, the improvements on benchmarks are marginal and UX is worse (more time, more expensive, final answer is muddied and bound by the quality of the best model)

reply

upvote

by monooso9 hours ago|

[-]

I came across this yesterday. Haven't tried it, but it looks interesting:

https://agent-relay.com/

reply

upvote

by eshaham7810 hours ago|

[-]

[dead]

reply

upvote

by btown11 hours ago|

[-]

For speculative decoding, wouldn’t this be of limited use for frontier models that don’t have the same tokenizer as Llama 3.1? Or would it be so good that retokenization/bridging would be worth it?

reply

upvote

by Zetaphor10 hours ago|

[-]

My understanding as well is that speculative decoding only works with a smaller quant of the same model. You're using the faster sampling of the smaller models representation of the larger models weights in order to attempt to accurately predict its token output. This wouldn't work cross-model as the token probabilities are completely different.

reply

upvote

by jasonjmcghee8 hours ago|

[-]

This is not correct.

Families of model sizes work great for speculative decoding. Use the 1B with the 32B or whatever.

It's a balance as you want it to be guessing correctly as much as possible but also be as fast as possible. Validation takes time and every guess needs to be validated etc

The model you're using to speculate could be anything, but if it's not guessing what the main model would predict, it's useless.

reply

upvote

by ashirviskas10 hours ago|

[-]

Smaller quant or smaller model?

Afaik it can work with anything, but sharing vocab solves a lot of headaches and the better token probs match, the more efficient it gets.

Which is why it is usually done with same family models and most often NOT just different quantizations of the same model.

reply

upvote

by vessenes10 hours ago|

[-]

I think they’d commission a quant directly. Benefits go down a lot when you leave model families.

reply

upvote

by joha427011 hours ago|

[-]

The guts of a LLM isn't something I'm well versed in, but

> to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model

suggests there is something I'm unaware of. If you compare the small and big model, don't you have to wait for the big model anyway and then what's the point? I assume I'm missing some detail here, but what?

reply

upvote

by connorbrinton11 hours ago|

[-]

Speculative decoding takes advantage of the fact that it's faster to validate that a big model would have produced a particular sequence of tokens than to generate that sequence of tokens from scratch, because validation can take more advantage of parallel processing. So the process is generate with small model -> validate with big model -> then generate with big model only if validation fails

More info:

* https://research.google/blog/looking-back-at-speculative-dec...

* https://pytorch.org/blog/hitchhikers-guide-speculative-decod...

reply

upvote

by sails10 hours ago|

[-]

See also speculative cascades which is a nice read and furthered my understanding of how it all works

https://research.google/blog/speculative-cascades-a-hybrid-a...

reply

upvote

by speedping11 hours ago|

[-]

Verification is faster than generation, one forward pass for verification of multiple tokens vs a pass for every new token in generation

reply

upvote

by vanviegen11 hours ago|

[-]

I don't understand how it would work either, but it may be something similar to this: https://developers.openai.com/api/docs/guides/predicted-outp...

reply

upvote

by ml_basics10 hours ago|

[-]

They are referring to a thing called "speculative decoding" I think.

reply

upvote

by cma11 hours ago|

[-]

When you predict with the small model, the big model can verify as more of a batch and be more similar in speed to processing input tokens, if the predictions are good and it doesn't have to be redone.

reply

upvote

by empath759 hours ago|

[-]

Think about this for solving questions in math where you need to explore a search space. You can run 100 of these for the same cost and time of doing one api call to open ai.

reply

upvote

by jameslk1 hours ago|

[-]

> Certainly interesting for very low latency applications which need < 10k tokens context.

I’m really curious if context will really matter if using methods like Recursive Language Models[0]. That method is suited to break down a huge amount of context into smaller subagents recursively, each working on a symbolic subset of the prompt.

The challenge with RLM seemed like it burned through a ton of tokens to trade for more accuracy. If tokens are cheap, RLM seems like it could be beneficial here to provide much more accuracy over large contexts despite what the underlying model can handle

0. https://arxiv.org/abs/2512.24601

reply

upvote

by soleveloper11 hours ago|

[-]

In 20$ a die, they could sell Gameboy style cartridges for different models.

reply

upvote

by noveltyaccount8 hours ago|

[-]

That would be very cool, get an upgraded model every couple of months. Maybe PCIe form factor.

reply

upvote

by soleveloper3 hours ago|

[-]

Yes, and even holding couple of cartridges for different scenarios e.g image generation, coding, tts/stt, etc

reply

upvote

by pennomi7 hours ago|

[-]

Make them shaped like floppy disks to confuse the younger generations.

reply

upvote

by Aissen10 hours ago|

[-]

> 880mm^2 die

That's a lot of surface, isn't it? As big an M1 Ultra (2x M1 Max at 432mm² on TSMC N5P), a bit bigger than an A100 (820mm² on TSMC N7) or H100 (814mm² on TSMC N5).

> The larger the die size, the lower the yield.

I wonder if that applies? What's the big deal if a few parameter have a few bit flips?

reply

upvote

by rbanffy10 hours ago|

[-]

> I wonder if that applies? What's the big deal if a few parameter have a few bit flips?

We get into the sci-fi territory where a machine achieves sentience because it has all the right manufacturing defects.

Reminds me of this https://en.wikipedia.org/wiki/A_Logic_Named_Joe

reply

upvote

by sowbug9 hours ago|

[-]

Also see Adrian Thompson's Xilinx 6200 FPGA, programmed by a genetic algorithm that worked but exploited nuances unique to that specific physical chip, meaning the software couldn't be copied to another chip. https://news.ycombinator.com/item?id=43152877

reply

upvote

by rbanffy7 hours ago|

[-]

I love that story.

reply

upvote

by philipwhiuk10 hours ago|

[-]

2000s movie line territory:

> There have always been ghosts in the machine. Random segments of code, that have grouped together to form unexpected protocols.

reply

upvote

by elternal_love12 hours ago|

[-]

Were we go towards really smart roboters. It is interesting what kind of diferent model chips they can produce.

reply

upvote

by varispeed12 hours ago|

[-]

There is nothing smart about current LLMs. They just regurgitate text compressed in their memory based on probability. None of the LLMs currently have actual understanding of what you ask them to do and what they respond with.

reply

upvote

by adamtaylor_139 hours ago|

[-]

If LLMs just regurgitate compressed text, they'd fail on any novel problem not in their training data. Yet, they routinely solve them, which means whatever's happening between input and output is more than retrieval, and calling it "not understanding" requires you to define understanding in a way that conveniently excludes everything except biological brains.

reply

upvote

by sfn428 hours ago|

[-]

Yes there are some fascinating emergent properties at play, but when they fail it's blatantly obvious that there's no actual intelligence nor understanding. They are very cool and very useful tools, I use them on a daily basis now and the way I can just paste a vague screenshot with some vague text and they get it and give a useful response blows my mind every time. But it's very clear that it's all just smoke and mirrors, they're not intelligent and you can't trust them with anything.

reply

upvote

by pennomi7 hours ago|

[-]

When humans fail a task, it’s obvious there is no actual intelligence nor understanding.

Intelligence is not as cool as you think it is.

reply

upvote

by sfn427 hours ago|

[-]

I assure you, intelligence is very cool.

reply

upvote

by varispeed9 hours ago|

[-]

They don't solve novel problems. But if you have such strong belief, please give us examples.

reply

upvote

by ainch3 hours ago|

[-]

Depends how precisely you define novel - I don't think LLMs are yet capable of posing and solving interesting problems, but they have been used to address known problems, and in doing so have contributed novel work. Examples include Erdos Problem #728[0] (Terence Tao said it was solved "more or less autonomously" by an LLM), IMO problems (Deepmind, OpenAI and Huang 2025), GPT-5.2 Pro contributing a conjecture in particle physics[1], systems like AlphaEvolve leveraging LLMs + evolutionary algorithms to generate new, faster algorithms for certain problems[2].

[0] https://mathstodon.xyz/@tao/115855840223258103

[1] https://huggingface.co/blog/dlouapre/gpt-single-minus-gluons

[2] https://deepmind.google/blog/alphaevolve-a-gemini-powered-co...

reply

upvote

by otabdeveloper44 hours ago|

[-]

> they'd fail on any novel problem not in their training data

Yes, and that's exactly what they do.

No, none of the problems you gave to the LLM while toying around with them are in any way novel.

reply

upvote

by adamtaylor_132 hours ago|

[-]

None of my codebases are in their training data, yet they routinely contribute to them in meaningful ways. They write code that I'm happy with that improves the codebases I work in.

Do you not consider that novel problem solving?

reply

upvote

by bsenftner11 hours ago|

[-]

We know that, but that does not make them unuseful. The opposite in fact, they are extremely useful in the hands of non-idiots.We just happen to have a oversupply of idiots at the moment, which AI is here to eradicate. /Sort of satire.

reply

upvote

by visarga10 hours ago|

[-]

So you are saying they are like copy, LLMs will copy some training data back to you? Why do we spend so much money training and running them if they "just regurgitate text compressed in their memory based on probability"? billions of dollars to build a lossy grep.

I think you are confused about LLMs - they take in context, and that context makes them generate new things, for existing things we have cp. By your logic pianos can't be creative instruments because they just produce the same 88 notes.

reply

upvote

by small_model11 hours ago|

[-]

Thats not how they work, pro-tip maybe don't comment until you have a good understanding?

reply

upvote

by fyltr11 hours ago|

[-]

Would you mind rectifying the wrong parts then?

reply

upvote

by retsibsi11 hours ago|

[-]

Phrases like "actual understanding", "true intelligence" etc. are not conducive to productive discussion unless you take the trouble to define what you mean by them (which ~nobody ever does). They're highly ambiguous and it's never clear what specific claims they do or don't imply when used by any given person.

But I think this specific claim is clearly wrong, if taken at face value:

> They just regurgitate text compressed in their memory

They're clearly capable of producing novel utterances, so they can't just be doing that. (Unless we're dealing with a very loose definition of "regurgitate", in which case it's probably best to use a different word if we want to understand each other.)

reply

upvote

by mhl4711 hours ago|

[-]

The fact that the outputs are probabilities is not important. What is important is how that output is computed.

You could imagine that it is possible to learn certain algorithms/ heuristics that "intelligence" is comprised of. No matter what you output. Training for optimal compression of tasks /taking actions -> could lead to intelligence being the best solution.

This is far from a formal argument but so is the stubborn reiteration off "it's just probabilities" or "it's just compression". Because this "just" thing is getting more an more capable of solving tasks that are surely not in the training data exactly like this.

reply

upvote

by 10072111 hours ago|

[-]

Huh? Their words are an accurate, if simplified, description of how they work.

reply

upvote

by beyondCritics11 hours ago|

[-]

Just HI slop. Ask any decent model, it can explain what's wrong this this description.

reply

upvote

by aurareturn12 hours ago|

[-]

Don’t forget that the 8B model requires 10 of said chips to run.

And it’s a 3bit quant. So 3GB ram requirement.

If they run 8B using native 16bit quant, it will use 60 H100 sized chips.

reply

upvote

by dust4212 hours ago|

[-]

> Don’t forget that the 8B model requires 10 of said chips to run.

Are you sure about that? If true it would definitely make it look a lot less interesting.

reply

upvote

by aurareturn12 hours ago|

[-]

Their 2.4 kW is for 10 chips it seems based on the next platform article.

I assume they need all 10 chips for their 8B q3 model. Otherwise, they would have said so or they would have put a more impressive model as the demo.

https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...

reply

upvote

by audunw11 hours ago|

[-]

It doesn’t make any sense to think you need the whole server to run one model. It’s much more likely that each server runs 10 instances of the model

1. It doesn’t make sense in terms of architecture. It’s one chip. You can’t split one model over 10 identical hardwire chips

2. It doesn’t add up with their claims of better power efficiency. 2.4kW for one model would be really bad.

reply

upvote

by aurareturn11 hours ago|

[-]

We are both wrong.

First, it is likely one chip for llama 8B q3 with 1k context size. This could fit into around 3GB of SRAM which is about the theoretical maximum for TSMC N6 reticle limit.

Second, their plan is to etch larger models across multiple connected chips. It’s physically impossible to run bigger models otherwise since 3GB SRAM is about the max you can have on an 850mm2 chip.

  followed by a frontier-class large language model running inference across a collection of HC cards by year-end under its HC2 architecture

https://mlq.ai/news/taalas-secures-169m-funding-to-develop-a...

reply

upvote

by pigpop1 hours ago|

[-]

Aren't they only using the SRAM for the KV cache? They mention that the hardwired weights have a very high density. They say about the ROM part:

> We have got this scheme for the mask ROM recall fabric – the hard-wired part – where we can store four bits away and do the multiply related to it – everything – with a single transistor. So the density is basically insane.

I'm not a hardware guy but they seem to be making a strong distinction between the techniques they're using for the weights vs KV cache

> In the current generation, our density is 8 billion parameters on the hard wired part of the chip., plus the SRAM to allow us to do KV caches, adaptations like fine tuning, and etc.

reply

upvote

by moralestapia11 hours ago|

[-]

Thanks for having a brain.

Not sure who started that "split into 10 chips" claim, it's just dumb.

This is Llama 3B hardcoded (literally) on one chip. That's what the startup is about, they emphasize this multiple times.

reply

upvote

by aurareturn11 hours ago|

[-]

It’s just dumb to think that one chip per model is their plan. They stated that their plan is to chain multiple chips together.

I was indeed wrong about 10 chips. I thought they would use llama 8B 16bit and a few thousand context size. It turns out, they used llama 8B 3bit with around 1k context size. That made me assume they must have chained multiple chips together since the max SRAM on TSMC n6 for reticle sized chip is only around 3GB.

reply

upvote

by xnx1 hours ago|

[-]

Hardware decoders make sense for fixed codecs like MPEG, but I can't see it making sense for small models that improve every 6 months.

reply

upvote

by WhitneyLand10 hours ago|

[-]

There’s a bit of a hidden cost here… the longevity of GPU hardware is going to be longer, it’s extended every time there’s an algorithmic improvement. Whereas any efficiency gains in software that are not compatible with this hardware will tend to accelerate their depreciation.

reply

upvote

by bsenftner11 hours ago|

[-]

Do not overlook traditional irrational investor exuberance, we've got an abundance of that right now. With the right PR manouveurs these guys could be a tulip craze.

reply

upvote

by mikhail-ramirez9 hours ago|

[-]

Yea its fast af but very quickly loses context/hallucinates from my own tests with large chunks of text

reply

upvote

by oliwary12 hours ago|

[-]

This is insane if true - could be super useful for data extraction tasks. Sounds like we could be talking in the cents per millions of tokens range.

reply

upvote

by zozbot23411 hours ago|

[-]

Low-latency inference is a huge waste of power; if you're going to the trouble of making an ASIC, it should be for dog-slow but very high throughput inference. Undervolt the devices as much as possible and use sub-threshold modes, multiple Vt and body biasing extensively to save further power and minimize leakage losses, but also keep working in fine-grained nodes to reduce areas and distances. The sensible goal is to expend the least possible energy per operation, even at increased latency.

reply

upvote

by dust4211 hours ago|

[-]

Low latency inference is very useful in voice-to-voice applications. You say it is a waste of power but at least their claim is that it is 10x more efficient. We'll see but if it works out it will definitely find its applications.

reply

upvote

by zozbot23411 hours ago|

[-]

This is not voice-to-voice though, end-to-end voice chat models (the Her UX) are completely different.

reply

upvote

by dust4211 hours ago|

[-]

I haven't found any end-to-end voice chat models useful. I had much better results with separate STT-LLM-TTS. One big problem is the turn detection and having inference with 150-200ms latency would allow for a whole new level of quality. I would just use it with a prompt: "You think the user is finished talking?" and then push it to a larger model. The AI should reply within the ballpark of 600ms-1000ms. Faster is often irritating, slower will make the user to start talking again.

reply

upvote

by PhunkyPhil5 hours ago|

[-]

I think it's really useful for agent to agent communication, as long as context loading doesn't become a bottleneck. Right now there can be noticeable delays under the hood, but at these speeds we'll never have to worry about latency when chain calling hundreds or thousands of agents in a network (I'm presuming this is going to take off in the future). Correct me if I'm wrong though.

reply

upvote

by robotnikman5 hours ago|

[-]

Sounds perfect for use in consumer devices.

reply

upvote

by Tepix10 hours ago|

[-]

Doesn't the blog state that it's now 4bit (the first gen was 3bit + 6bit)?

reply

upvote

by empath7510 hours ago|

[-]

An on-device reasoning model what that kind of speed and cost would completely change the way people use their computers. It would be closer to star trek than anything else we've ever had. You'd never have to type anything or use a mouse again.

reply