The path to ubiquitous AI (17k tokens/sec)

upvote

The path to ubiquitous AI (17k tokens/sec)

(taalas.com)

633 points

by sidnarsipur11 hours ago |

upvote

by dust4210 hours ago|

[-]

This is not a general purpose chip but specialized for high speed, low latency inference with small context. But it is potentially a lot cheaper than Nvidia for those purposes.

Tech summary:

  - 15k tok/sec on 8B dense 3bit quant (llama 3.1) 
  - limited KV cache
  - 880mm^2 die, TSMC 6nm, 53B transistors
  - presumably 200W per chip
  - 20x cheaper to produce
  - 10x less energy per token for inference
  - max context size: flexible
  - mid-sized thinking model upcoming this spring on same hardware
  - next hardware supposed to be FP4 
  - a frontier LLM planned within twelve months

This is all from their website, I am not affiliated. The founders have 25 years of career across AMD, Nvidia and others, $200M VC so far.

Certainly interesting for very low latency applications which need < 10k tokens context. If they deliver in spring, they will likely be flooded with VC money.

Not exactly a competitor for Nvidia but probably for 5-10% of the market.

Back of napkin, the cost for 1mm^2 of 6nm wafer is ~$0.20. So 1B parameters need about $20 of die. The larger the die size, the lower the yield. Supposedly the inference speed remains almost the same with larger models.

Interview with the founders: https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...

reply

upvote

by vessenes10 hours ago|

[-]

This math is useful. Lots of folks scoffing in the comments below. I have a couple reactions, after chatting with it:

1) 16k tokens / second is really stunningly fast. There’s an old saying about any factor of 10 being a new science / new product category, etc. This is a new product category in my mind, or it could be. It would be incredibly useful for voice agent applications, realtime loops, realtime video generation, .. etc.

2) https://nvidia.github.io/TensorRT-LLM/blogs/H200launch.html Has H200 doing 12k tokens/second on llama 2 12b fb8. Knowing these architectures that’s likely a 100+ ish batched run, meaning time to first token is almost certainly slower than taalas. Probably much slower, since Taalas is like milliseconds.

3) Jensen has these pareto curve graphs — for a certain amount of energy and a certain chip architecture, choose your point on the curve to trade off throughput vs latency. My quick math is that these probably do not shift the curve. The 6nm process vs 4nm process is likely 30-40% bigger, draws that much more power, etc; if we look at the numbers they give and extrapolate to an fp8 model (slower), smaller geometry (30% faster and lower power) and compare 16k tokens/second for taalas to 12k tokens/s for an h200, these chips are in the same ballpark curve.

However, I don’t think the H200 can reach into this part of the curve, and that does make these somewhat interesting. In fact even if you had a full datacenter of H200s already running your model, you’d probably buy a bunch of these to do speculative decoding - it’s an amazing use case for them; speculative decoding relies on smaller distillations or quants to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model.

Upshot - I think these will sell, even on 6nm process, and the first thing I’d sell them to do is speculative decoding for bread and butter frontier models. The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.

I hope these guys make it! I bet the v3 of these chips will be serving some bread and butter API requests, which will be awesome.

reply

upvote

by rbanffy8 hours ago|

[-]

> any factor of 10 being a new science / new product category,

I often remind people two orders of quantitative change is a qualitative change.

> The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.

The real product they have is automation. They figured out a way to compile a large model into a circuit. That's, in itself, pretty impressive. If they can do this, they can also compile models to an HDL and deploy them to large FPGA simulators for quick validation. If we see models maturing at a "good enough" state, even a longer turnaround between model release and silicon makes sense.

While I also see lots of these systems running standalone, I think they'll really shine combined with more flexible inference engines, running the unchanging parts of the model while the coupled inference engine deals with whatever is too new to have been baked into silicon.

I'm concerned with the environmental impact. Chip manufacture is not very clean and these chips will need to be swapped out and replaced at a cadence higher than we currently do with GPUs.

reply

upvote

by ttul6 hours ago|

[-]

Having dabbled in VLSI in the early-2010s, half the battle is getting a manufacturing slot with TSMC. It’s a dark art with secret handshakes. This demonstrator chip is an enormous accomplishment.

reply

upvote

by vessenes1 hours ago|

[-]

Yeah and a team I’m not familiar with — I didn’t check bios but they don’t lead with ‘our team made this or that gpu for this or that bigco’.

The design ip at 6nm is still tough; I feel like this team must have at least one real genius and some incredibly good support at tsmc. Or they’ve been waiting a year for a slot :)

reply

upvote

by dust421 hours ago|

[-]

From the article:

"Ljubisa Bajic desiged video encoders for Teralogic and Oak Technology before moving over to AMD and rising through the engineering ranks to be the architect and senior manager of the company’s hybrid CPU-GPU chip designs for PCs and servers. Bajic did a one-year stint at Nvidia as s senior architect, bounced back to AMD as a director of integrated circuit design for two years, and then started Tenstorrent."

His wife (COO) worked at Altera, ATI, AMD and Testorrent.

"Drago Ignjatovic, who was a senior design engineer working on AMD APUs and GPUs and took over for Ljubisa Bajic as director of ASIC design when the latter left to start Tenstorrent. Nine months later, Ignjatovic joined Tenstorrent as its vice president of hardware engineering, and he started Taalas with the Bajices as the startup’s chief technology officer."

Not a youngster gang...

reply

upvote

by VagabundoP7 hours ago|

[-]

There might be a foodchain of lower order uses when they become "obsolete".

reply

upvote

by rbanffy5 hours ago|

[-]

I think there will be a lot of space for sensorial models in robotics, as the laws of physics don't change much, and a light switch or automobile controls have remained stable and consistent over the last decades.

reply

upvote

by Gareth3219 hours ago|

[-]

I think the next major innovation is going to be intelligent model routing. I've been exploring OpenClaw and OpenRouter, and there is a real lack of options to select the best model for the job and execute. The providers are trying to do that with their own models, but none of them offer everything to everyone at all times. I see a future with increasingly niche models being offered for all kinds of novel use cases. We need a way to fluidly apply the right model for the job.

reply

upvote

by nylonstrung9 hours ago|

[-]

Agree that routing is becoming the critical layer here. Vllm iris is really promising for this https://blog.vllm.ai/2026/01/05/vllm-sr-iris.html

There's already some good work on router benchmarking which is pretty interesting

reply

upvote

by condiment6 hours ago|

[-]

At 16k tokens/s why bother routing? We're talking about multiple orders of magnitude faster and cheaper execution.

Abundance supports different strategies. One approach: Set a deadline for a response, send the turn to every AI that could possibly answer, and when the deadline arrives, cancel any request that hasn't yet completed. You know a priori which models have the highest quality in aggregate. Pick that one.

reply

upvote

by IanCal5 hours ago|

[-]

The best coding model won’t be the best roleplay one which won’t be the best at tool use. It depends what you want to do in order to pick the best model.

reply

upvote

by PhunkyPhil4 hours ago|

[-]

I'm not saying you're wrong, but why is this the case?

I'm out of the loop on training LLMs, but to me it's just pure data input. Are they choosing to include more code rather than, say fiction books?

reply

upvote

by jmalicki2 hours ago|

[-]

There is the pre-training, where you passively read stuff from the web.

From there you go to RL training, where humans are grading model responses, or the AI is writing code to try to pass tests and learning how to get the tests to pass, etc. The RL phase is pretty important because it's not passive, and it can focus on the weaker areas of the model too, so you can actually train on a larger dataset than the sum of recorded human knowledge.

reply

upvote

by refulgentis4 hours ago|

[-]

I’ll go ahead and say they’re wrong (source: building and maintaining llm client with llama.cpp integrated & 40+ 3p models via http)

I desperately want there to be differentiation. Reality has shown over and over again it doesn’t matter. Even if you do same query across X models and then some form of consensus, the improvements on benchmarks are marginal and UX is worse (more time, more expensive, final answer is muddied and bound by the quality of the best model)

reply

upvote

by monooso8 hours ago|

[-]

I came across this yesterday. Haven't tried it, but it looks interesting:

https://agent-relay.com/

reply

upvote

by eshaham789 hours ago|

[-]

[dead]

reply

upvote

by btown9 hours ago|

[-]

For speculative decoding, wouldn’t this be of limited use for frontier models that don’t have the same tokenizer as Llama 3.1? Or would it be so good that retokenization/bridging would be worth it?

reply

upvote

by Zetaphor9 hours ago|

[-]

My understanding as well is that speculative decoding only works with a smaller quant of the same model. You're using the faster sampling of the smaller models representation of the larger models weights in order to attempt to accurately predict its token output. This wouldn't work cross-model as the token probabilities are completely different.

reply

upvote

by jasonjmcghee7 hours ago|

[-]

This is not correct.

Families of model sizes work great for speculative decoding. Use the 1B with the 32B or whatever.

It's a balance as you want it to be guessing correctly as much as possible but also be as fast as possible. Validation takes time and every guess needs to be validated etc

The model you're using to speculate could be anything, but if it's not guessing what the main model would predict, it's useless.

reply

upvote

by ashirviskas8 hours ago|

[-]

Smaller quant or smaller model?

Afaik it can work with anything, but sharing vocab solves a lot of headaches and the better token probs match, the more efficient it gets.

Which is why it is usually done with same family models and most often NOT just different quantizations of the same model.

reply

upvote

by vessenes8 hours ago|

[-]

I think they’d commission a quant directly. Benefits go down a lot when you leave model families.

reply

upvote

by joha427010 hours ago|

[-]

The guts of a LLM isn't something I'm well versed in, but

> to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model

suggests there is something I'm unaware of. If you compare the small and big model, don't you have to wait for the big model anyway and then what's the point? I assume I'm missing some detail here, but what?

reply

upvote

by connorbrinton9 hours ago|

[-]

Speculative decoding takes advantage of the fact that it's faster to validate that a big model would have produced a particular sequence of tokens than to generate that sequence of tokens from scratch, because validation can take more advantage of parallel processing. So the process is generate with small model -> validate with big model -> then generate with big model only if validation fails

More info:

* https://research.google/blog/looking-back-at-speculative-dec...

* https://pytorch.org/blog/hitchhikers-guide-speculative-decod...

reply

upvote

by sails9 hours ago|

[-]

See also speculative cascades which is a nice read and furthered my understanding of how it all works

https://research.google/blog/speculative-cascades-a-hybrid-a...

reply

upvote

by speedping9 hours ago|

[-]

Verification is faster than generation, one forward pass for verification of multiple tokens vs a pass for every new token in generation

reply

upvote

by vanviegen9 hours ago|

[-]

I don't understand how it would work either, but it may be something similar to this: https://developers.openai.com/api/docs/guides/predicted-outp...

reply

upvote

by ml_basics9 hours ago|

[-]

They are referring to a thing called "speculative decoding" I think.

reply

upvote

by cma9 hours ago|

[-]

When you predict with the small model, the big model can verify as more of a batch and be more similar in speed to processing input tokens, if the predictions are good and it doesn't have to be redone.

reply

upvote

by empath758 hours ago|

[-]

Think about this for solving questions in math where you need to explore a search space. You can run 100 of these for the same cost and time of doing one api call to open ai.

reply

upvote

by xnx20 minutes ago|

[-]

Hardware decoders make sense for fixed codecs like MPEG, but I can't see it making sense for small models that improve every 6 months.

reply

upvote

by soleveloper9 hours ago|

[-]

In 20$ a die, they could sell Gameboy style cartridges for different models.

reply

upvote

by noveltyaccount7 hours ago|

[-]

That would be very cool, get an upgraded model every couple of months. Maybe PCIe form factor.

reply

upvote

by soleveloper1 hours ago|

[-]

Yes, and even holding couple of cartridges for different scenarios e.g image generation, coding, tts/stt, etc

reply

upvote

by pennomi6 hours ago|

[-]

Make them shaped like floppy disks to confuse the younger generations.

reply

upvote

by Aissen9 hours ago|

[-]

> 880mm^2 die

That's a lot of surface, isn't it? As big an M1 Ultra (2x M1 Max at 432mm² on TSMC N5P), a bit bigger than an A100 (820mm² on TSMC N7) or H100 (814mm² on TSMC N5).

> The larger the die size, the lower the yield.

I wonder if that applies? What's the big deal if a few parameter have a few bit flips?

reply

upvote

by rbanffy9 hours ago|

[-]

> I wonder if that applies? What's the big deal if a few parameter have a few bit flips?

We get into the sci-fi territory where a machine achieves sentience because it has all the right manufacturing defects.

Reminds me of this https://en.wikipedia.org/wiki/A_Logic_Named_Joe

reply

upvote

by sowbug7 hours ago|

[-]

Also see Adrian Thompson's Xilinx 6200 FPGA, programmed by a genetic algorithm that worked but exploited nuances unique to that specific physical chip, meaning the software couldn't be copied to another chip. https://news.ycombinator.com/item?id=43152877

reply

upvote

by rbanffy5 hours ago|

[-]

I love that story.

reply

upvote

by philipwhiuk9 hours ago|

[-]

2000s movie line territory:

> There have always been ghosts in the machine. Random segments of code, that have grouped together to form unexpected protocols.

reply

upvote

by elternal_love10 hours ago|

[-]

Were we go towards really smart roboters. It is interesting what kind of diferent model chips they can produce.

reply

upvote

by varispeed10 hours ago|

[-]

There is nothing smart about current LLMs. They just regurgitate text compressed in their memory based on probability. None of the LLMs currently have actual understanding of what you ask them to do and what they respond with.

reply

upvote

by adamtaylor_138 hours ago|

[-]

If LLMs just regurgitate compressed text, they'd fail on any novel problem not in their training data. Yet, they routinely solve them, which means whatever's happening between input and output is more than retrieval, and calling it "not understanding" requires you to define understanding in a way that conveniently excludes everything except biological brains.

reply

upvote

by sfn426 hours ago|

[-]

Yes there are some fascinating emergent properties at play, but when they fail it's blatantly obvious that there's no actual intelligence nor understanding. They are very cool and very useful tools, I use them on a daily basis now and the way I can just paste a vague screenshot with some vague text and they get it and give a useful response blows my mind every time. But it's very clear that it's all just smoke and mirrors, they're not intelligent and you can't trust them with anything.

reply

upvote

by pennomi6 hours ago|

[-]

When humans fail a task, it’s obvious there is no actual intelligence nor understanding.

Intelligence is not as cool as you think it is.

reply

upvote

by sfn426 hours ago|

[-]

I assure you, intelligence is very cool.

reply

upvote

by varispeed8 hours ago|

[-]

They don't solve novel problems. But if you have such strong belief, please give us examples.

reply

upvote

by ainch2 hours ago|

[-]

Depends how precisely you define novel - I don't think LLMs are yet capable of posing and solving interesting problems, but they have been used to address known problems, and in doing so have contributed novel work. Examples include Erdos Problem #728[0] (Terence Tao said it was solved "more or less autonomously" by an LLM), IMO problems (Deepmind, OpenAI and Huang 2025), GPT-5.2 Pro contributing a conjecture in particle physics[1], systems like AlphaEvolve leveraging LLMs + evolutionary algorithms to generate new, faster algorithms for certain problems[2].

[0] https://mathstodon.xyz/@tao/115855840223258103

[1] https://huggingface.co/blog/dlouapre/gpt-single-minus-gluons

[2] https://deepmind.google/blog/alphaevolve-a-gemini-powered-co...

reply

upvote

by otabdeveloper42 hours ago|

[-]

> they'd fail on any novel problem not in their training data

Yes, and that's exactly what they do.

No, none of the problems you gave to the LLM while toying around with them are in any way novel.

reply

upvote

by adamtaylor_131 hours ago|

[-]

None of my codebases are in their training data, yet they routinely contribute to them in meaningful ways. They write code that I'm happy with that improves the codebases I work in.

Do you not consider that novel problem solving?

reply

upvote

by bsenftner9 hours ago|

[-]

We know that, but that does not make them unuseful. The opposite in fact, they are extremely useful in the hands of non-idiots.We just happen to have a oversupply of idiots at the moment, which AI is here to eradicate. /Sort of satire.

reply

upvote

by visarga8 hours ago|

[-]

So you are saying they are like copy, LLMs will copy some training data back to you? Why do we spend so much money training and running them if they "just regurgitate text compressed in their memory based on probability"? billions of dollars to build a lossy grep.

I think you are confused about LLMs - they take in context, and that context makes them generate new things, for existing things we have cp. By your logic pianos can't be creative instruments because they just produce the same 88 notes.

reply

upvote

by small_model10 hours ago|

[-]

Thats not how they work, pro-tip maybe don't comment until you have a good understanding?

reply

upvote

by fyltr10 hours ago|

[-]

Would you mind rectifying the wrong parts then?

reply

upvote

by retsibsi9 hours ago|

[-]

Phrases like "actual understanding", "true intelligence" etc. are not conducive to productive discussion unless you take the trouble to define what you mean by them (which ~nobody ever does). They're highly ambiguous and it's never clear what specific claims they do or don't imply when used by any given person.

But I think this specific claim is clearly wrong, if taken at face value:

> They just regurgitate text compressed in their memory

They're clearly capable of producing novel utterances, so they can't just be doing that. (Unless we're dealing with a very loose definition of "regurgitate", in which case it's probably best to use a different word if we want to understand each other.)

reply

upvote

by mhl479 hours ago|

[-]

The fact that the outputs are probabilities is not important. What is important is how that output is computed.

You could imagine that it is possible to learn certain algorithms/ heuristics that "intelligence" is comprised of. No matter what you output. Training for optimal compression of tasks /taking actions -> could lead to intelligence being the best solution.

This is far from a formal argument but so is the stubborn reiteration off "it's just probabilities" or "it's just compression". Because this "just" thing is getting more an more capable of solving tasks that are surely not in the training data exactly like this.

reply

upvote

by 1007219 hours ago|

[-]

Huh? Their words are an accurate, if simplified, description of how they work.

reply

upvote

by beyondCritics9 hours ago|

[-]

Just HI slop. Ask any decent model, it can explain what's wrong this this description.

reply

upvote

by aurareturn10 hours ago|

[-]

Don’t forget that the 8B model requires 10 of said chips to run.

And it’s a 3bit quant. So 3GB ram requirement.

If they run 8B using native 16bit quant, it will use 60 H100 sized chips.

reply

upvote

by dust4210 hours ago|

[-]

> Don’t forget that the 8B model requires 10 of said chips to run.

Are you sure about that? If true it would definitely make it look a lot less interesting.

reply

upvote

by aurareturn10 hours ago|

[-]

Their 2.4 kW is for 10 chips it seems based on the next platform article.

I assume they need all 10 chips for their 8B q3 model. Otherwise, they would have said so or they would have put a more impressive model as the demo.

https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...

reply

upvote

by audunw10 hours ago|

[-]

It doesn’t make any sense to think you need the whole server to run one model. It’s much more likely that each server runs 10 instances of the model

1. It doesn’t make sense in terms of architecture. It’s one chip. You can’t split one model over 10 identical hardwire chips

2. It doesn’t add up with their claims of better power efficiency. 2.4kW for one model would be really bad.

reply

upvote

by aurareturn9 hours ago|

[-]

We are both wrong.

First, it is likely one chip for llama 8B q3 with 1k context size. This could fit into around 3GB of SRAM which is about the theoretical maximum for TSMC N6 reticle limit.

Second, their plan is to etch larger models across multiple connected chips. It’s physically impossible to run bigger models otherwise since 3GB SRAM is about the max you can have on an 850mm2 chip.

  followed by a frontier-class large language model running inference across a collection of HC cards by year-end under its HC2 architecture

https://mlq.ai/news/taalas-secures-169m-funding-to-develop-a...

reply

upvote

by pigpop10 minutes ago|

[-]

Aren't they only using the SRAM for the KV cache? They mention that the hardwired weights have a very high density. They say about the ROM part:

> We have got this scheme for the mask ROM recall fabric – the hard-wired part – where we can store four bits away and do the multiply related to it – everything – with a single transistor. So the density is basically insane.

I'm not a hardware guy but they seem to be making a strong distinction between the techniques they're using for the weights vs KV cache

> In the current generation, our density is 8 billion parameters on the hard wired part of the chip., plus the SRAM to allow us to do KV caches, adaptations like fine tuning, and etc.

reply

upvote

by moralestapia10 hours ago|

[-]

Thanks for having a brain.

Not sure who started that "split into 10 chips" claim, it's just dumb.

This is Llama 3B hardcoded (literally) on one chip. That's what the startup is about, they emphasize this multiple times.

reply

upvote

by aurareturn9 hours ago|

[-]

It’s just dumb to think that one chip per model is their plan. They stated that their plan is to chain multiple chips together.

I was indeed wrong about 10 chips. I thought they would use llama 8B 16bit and a few thousand context size. It turns out, they used llama 8B 3bit with around 1k context size. That made me assume they must have chained multiple chips together since the max SRAM on TSMC n6 for reticle sized chip is only around 3GB.

reply

upvote

by WhitneyLand8 hours ago|

[-]

There’s a bit of a hidden cost here… the longevity of GPU hardware is going to be longer, it’s extended every time there’s an algorithmic improvement. Whereas any efficiency gains in software that are not compatible with this hardware will tend to accelerate their depreciation.

reply

upvote

by bsenftner9 hours ago|

[-]

Do not overlook traditional irrational investor exuberance, we've got an abundance of that right now. With the right PR manouveurs these guys could be a tulip craze.

reply

upvote

by mikhail-ramirez7 hours ago|

[-]

Yea its fast af but very quickly loses context/hallucinates from my own tests with large chunks of text

reply

upvote

by robotnikman4 hours ago|

[-]

Sounds perfect for use in consumer devices.

reply

upvote

by oliwary10 hours ago|

[-]

This is insane if true - could be super useful for data extraction tasks. Sounds like we could be talking in the cents per millions of tokens range.

reply

upvote

by zozbot23410 hours ago|

[-]

Low-latency inference is a huge waste of power; if you're going to the trouble of making an ASIC, it should be for dog-slow but very high throughput inference. Undervolt the devices as much as possible and use sub-threshold modes, multiple Vt and body biasing extensively to save further power and minimize leakage losses, but also keep working in fine-grained nodes to reduce areas and distances. The sensible goal is to expend the least possible energy per operation, even at increased latency.

reply

upvote

by dust4210 hours ago|

[-]

Low latency inference is very useful in voice-to-voice applications. You say it is a waste of power but at least their claim is that it is 10x more efficient. We'll see but if it works out it will definitely find its applications.

reply

upvote

by zozbot23410 hours ago|

[-]

This is not voice-to-voice though, end-to-end voice chat models (the Her UX) are completely different.

reply

upvote

by dust429 hours ago|

[-]

I haven't found any end-to-end voice chat models useful. I had much better results with separate STT-LLM-TTS. One big problem is the turn detection and having inference with 150-200ms latency would allow for a whole new level of quality. I would just use it with a prompt: "You think the user is finished talking?" and then push it to a larger model. The AI should reply within the ballpark of 600ms-1000ms. Faster is often irritating, slower will make the user to start talking again.

reply

upvote

by PhunkyPhil4 hours ago|

[-]

I think it's really useful for agent to agent communication, as long as context loading doesn't become a bottleneck. Right now there can be noticeable delays under the hood, but at these speeds we'll never have to worry about latency when chain calling hundreds or thousands of agents in a network (I'm presuming this is going to take off in the future). Correct me if I'm wrong though.

reply

upvote

by Tepix9 hours ago|

[-]

Doesn't the blog state that it's now 4bit (the first gen was 3bit + 6bit)?

reply

upvote

by empath758 hours ago|

[-]

An on-device reasoning model what that kind of speed and cost would completely change the way people use their computers. It would be closer to star trek than anything else we've ever had. You'd never have to type anything or use a mouse again.

reply

upvote

by Alifatisk5 hours ago|

[-]

What's happening in the comment section? How come so many cannot understand that his is running Llama 3.1 8B? Why are people judging its accuracy? It's almost a 2 years old 8B param model, why are people expecting to see Opus level response!?

The focus here should be on the custom hardware they are producing and its performance, that is whats impressive. Imagine putting GLM-5 on this, that'd be insane.

This reminds me a lot of when I tried the Mercury coder model by Inceptionlabs, they are creating something called a dLLM which is like a diffusion based llm. The speed is still impressive when playing aroun with it sometimes. But this, this is something else, it's almost unbelievable. As soon as I hit the enter key, the response appears, it feels instant.

I am also curious about Taalas pricing.

> Taalas’ silicon Llama achieves 17K tokens/sec per user, nearly 10X faster than the current state of the art, while costing 20X less to build, and consuming 10X less power.

Do we have an idea of how much a unit / inference / api will cost?

Also, considering how fast people switch models to keep up with the pace. Is there really a potential market for hardware designed for one model only? What will they do when they want to upgrade to a better version? Throw the current hardware and buy another one? Shouldn't there be a more flexible way? Maybe only having to switch the chip on top like how people upgrade CPUs. I don't know, just thinking out loudly.

reply

upvote

by mike_hearn4 hours ago|

[-]

They don't give cost figures in their blog post but they do here:

https://www.nextplatform.com/wp-content/uploads/2026/02/taal...

Probably they don't know what the market will bear and want to do some exploratory pricing, hence the "contact us" API access form. That's fair enough. But they're claiming orders of magnitude cost reduction.

> Is there really a potential market for hardware designed for one model only?

I'm sure there is. Models are largely interchangeable especially as the low end. There are lots of use cases where you don't need super smart models but cheapness and fastness can matter a lot.

Think about a simple use case: a company has a list of one million customer names but no information about gender or age. They'd like to get a rough understanding of this. Mapping name -> guessed gender, rough guess of age is a simple problem for even dumb LLMs. I just tried it on ChatJimmy and it worked fine. For this kind of exploratory data problem you really benefit from mass parallelism, low cost and low latency.

> Shouldn't there be a more flexible way?

The whole point of their design is to sacrifice flexibility for speed, although they claim they support fine tunes via LoRAs. LLMs are already supremely flexible so it probably doesn't matter.

reply

upvote

by pigpop1 minutes ago|

[-]

Yes, there are all kinds of fuzzy NLP tasks that this would be great for. Jobs where you can chunk the text into small units and add instructions and only need a short response. You could burn through huge data sets very quickly using these chips.

reply

upvote

by himata41134 hours ago|

[-]

I personally don't buy it, cerebras is way more advanced than this, comparing this tok/s to cerebras is disingenious.

reply

upvote

by jameslk4 minutes ago|

[-]

The implications for RLM is really interesting. RLM is expensive because of token economics. But when tokens are so cheap and fast to generate, context size of the model matters a lot less

Also interesting implications for optimization-driven frameworks like DSPy. If you have an eval loop and useful reward function, you can iterate to the best possible response every time and ignore the cost of each attempt

reply

upvote

by mbh15924 minutes ago|

[-]

So cool, what's underappreciated imo: 17k tokens/sec doesn't just change deployment economics. It changes what evaluation means, static MMLU-style tests were designed around human-paced interaction. At this throughput you can run tens of thousands of adversarial agent interactions in the time a standard benchmark takes. Speed doesn't make static evals better it makes them even more obviously inadequate.

reply

upvote

by freakynit10 hours ago|

[-]

Holy cow their chatapp demo!!! I for first time thought i mistakenly pasted the answer. It was literally in a blink of an eye.!!

https://chatjimmy.ai/

reply

upvote

by qingcharles7 hours ago|

[-]

I asked it to design a submarine for my cat and literally the instant my finger touched return the answer was there. And that is factoring in the round-trip time for the data too. Crazy.

The answer wasn't dumb like others are getting. It was pretty comprehensive and useful.

  While the idea of a feline submarine is adorable, please be aware that building a real submarine requires significant expertise, specialized equipment, and resources.

reply

upvote

by robotpepi4 hours ago|

[-]

it's incredible how many people are commenting here without having read the article. they completely lost the point.

reply

upvote

by smusamashah8 hours ago|

[-]

With this speed, you can keep looping and generating code until it passes all tests. If you have tests.

Generate lots of solutions and mix and match. This allows a new way to look at LLMs.

reply

upvote

by Retr0id8 hours ago|

[-]

Not just looping, you could do a parallel graph search of the solution-space until you hit one that works.

reply

upvote

by xi_studio7 hours ago|

[-]

Infinite Monkey Theory just reached its peak

reply

upvote

by dave1010uk3 hours ago|

[-]

You could also parse prompts into an AST, run inference, run evals, then optimise the prompts with something like a genetic algorithm.

reply

upvote

by Epskampie8 hours ago|

[-]

And then it's slow again to finally find a correct answer...

reply

upvote

by 346795 hours ago|

[-]

It won't find the correct answer. Garbage in, garbage out.

reply

upvote

by otabdeveloper42 hours ago|

[-]

This doesn't work. The model outputs the most probable tokens. Running it again and asking for less probable tokens just results in the same but with more errors.

reply

upvote

by therealdrag014 minutes ago|

[-]

Do you not have experience with agents solving problems? They already successfully do this. They try different things until they get a solution.

reply

upvote

by turnsout5 hours ago|

[-]

Agreed, this is exciting, and has me thinking about completely different orchestrator patterns. You could begin to approach the solution space much more like a traditional optimization strategy such as CMA-ES. Rather than expect the first answer to be correct, you diverge wildly before converging.

reply

upvote

by MattRix8 hours ago|

[-]

This is what people already do with “ralph” loops using the top coding models. It’s slow relative to this, but still very fast compared to hand-coding.

reply

upvote

by amelius9 hours ago|

[-]

OK investors, time to pull out of OpenAI and move all your money to ChatJimmy.

reply

upvote

by freakynit9 hours ago|

[-]

A related argument I raised a few days back on HN:

What's the moat with with these giant data-centers that are being built with 100's of billions of dollars on nvidia chips?

If such chips can be built so easily, and offer this insane level of performance at 10x efficiency, then one thing is 100% sure: more such startups are coming... and with that, an entire new ecosystem.

reply

upvote

by jzymbaluk5 hours ago|

[-]

You'd still need those giant data centers for training new frontier models. These Taalas chips, if they work, seem to do the job of inference well, but training will still require general purpose GPU compute

reply

upvote

by bonoboTP1 hours ago|

[-]

Next up: wire up a specialized chip to run the training loop of a specific architecture.

reply

upvote

by codebje9 hours ago|

[-]

RAM hoarding is, AFAICT, the moat.

reply

upvote

by freakynit9 hours ago|

[-]

lol... true that for now though

reply

upvote

by Windchaser5 hours ago|

[-]

Yeah, just cause Cisco had a huge market lead on telecom in the late '90s, it doesn't mean they kept it.

(And people nowadays: "Who's Cisco?")

reply

upvote

by bee_rider8 hours ago|

[-]

I think their hope is that they’ll have the “brand name” and expertise to have a good head start when real inference hardware comes out. It does seem very strange, though, to have all these massive infrastructure investment on what is ultimately going to be useless prototyping hardware.

reply

upvote

by elictronic7 hours ago|

[-]

Tools like openclaw start making the models a commodity.

I need some smarts to route my question to the correct model. I wont care which that is. Selling commodities is notorious for slow and steady growth.

reply

upvote

by wmf2 hours ago|

[-]

Nvidia bought all the capacity so their competitors can't be manufactured at scale.

reply

upvote

by mlboss4 hours ago|

[-]

If I am not mistaken this chip was build specifically for the llama 8b model. Nvidia chips are general purpose.

reply

upvote

by raincole9 hours ago|

[-]

You mean Nvidia?

reply

upvote

by gwd9 hours ago|

[-]

I dunno, it pretty quickly got stuck; the "attach file" didn't seem to work, and when I asked "can you see the attachment" it replied to my first message rather than my question.

reply

upvote

by scosman9 hours ago|

[-]

It’s llama 3.1 8B. No vision, not smart. It’s just a technical demo.

reply

upvote

by anthonypasq7 hours ago|

[-]

why is everyone seemingly incapable of understanding this? waht is going on here? Its like ai doomers consistently have the foresight of a rat. yeah no shit it sucks its running llama 3 8b, but theyre completely incapable of extrapolation.

reply

upvote

by freakynit9 hours ago|

[-]

Hmm.. I had tried simple chat converation without file attachments.

reply

upvote

by zwaps9 hours ago|

[-]

I got 16.000 tokens per second ahaha

reply

upvote

by bsenftner9 hours ago|

[-]

I get nothing, no replies to anything.

reply

upvote

by freakynit9 hours ago|

[-]

Maybe hn and reddit crowd have overloaded them lol

reply

upvote

by elliotbnvl9 hours ago|

[-]

That… what…

reply

upvote

by PlatoIsADisease8 hours ago|

[-]

Well it got all 10 incorrect when I asked for top 10 catchphrases from a character in Plato's books. It confused the baddie for Socrates.

reply

upvote

by rvz8 hours ago|

[-]

Fast, but stupid.

   Me: "How many r's in strawberry?"

   Jimmy: There are 2 r's in "strawberry".

   Generated in 0.001s • 17,825 tok/s

The question is not about how fast it is. The real question(s) are:

   1. How is this worth it over diffusion LLMs (No mention of diffusion LLMs at all in this thread)

(This also assumes that diffusion LLMs will get faster)

   2. Will Talaas also work with reasoning models, especially those that are beyond 100B parameters and with the output being correct? 

   3. How long will it take to create newer models to be turned into silicon? (This industry moves faster than Talaas.)

   4. How does this work when one needs to fine-tune the model, but still benefit from the speed advantages?

reply

upvote

by mike_hearn4 hours ago|

[-]

The blog answers all those questions. It says they're working on fabbing a reasoning model this summer. It also says how long they think they need to fab new models, and that the chips support LoRAs and tweaking context window size.

I don't get these posts about ChatJimmy's intelligence. It's a heavily quantized Llama 3, using a custom quantization scheme because that was state of the art when they started. They claim they can update quickly (so I wonder why they didn't wait a few more months tbh and fab a newer model). Llama 3 wasn't very smart but so what, a lot of LLM use cases don't need smart, they need fast and cheap.

Also apparently they can run DeepSeek R1 also, and they have benchmarks for that. New models only require a couple of new masks so they're flexible.

reply

upvote

by simlevesque5 hours ago|

[-]

LLMs can't count. They need tool use to answer these questions accurately.

reply

upvote

by refsys6 hours ago|

[-]

[dead]

reply

upvote

by b0ner_t0ner8 hours ago|

[-]

I asked, “What are the newest restaurants in New York City?”

Jimmy replied with, “2022 and 2023 openings:”

0_0

reply

upvote

by freakynit8 hours ago|

[-]

Well, technically it's answer is correct when you consider it's knowledge cutoff date... it just gave you a generic always right answer :)

reply

upvote

by xi_studio7 hours ago|

[-]

chatjimmy's trained on LLama 3.1

reply

upvote

by jvidalv8 hours ago|

[-]

Is super fast but also super inaccurate, I would say not even gpt-3 levels.

reply

upvote

by roywiggins4 hours ago|

[-]

That's because it's llama3 8b.

reply

upvote

by empath758 hours ago|

[-]

There are a lot of people here that are completely missing the point. What is it called where you look at a point of time and judge an idea without seemingly being able to imagine 5 seconds into the future.

reply

upvote

by Alifatisk5 hours ago|

[-]

“static evaluation”

reply

upvote

by Etheryte9 hours ago|

[-]

It is incredibly fast, on that I agree, but even simple queries I tried got very inaccurate answers. Which makes sense, it's essentially a trade off of how much time you give it to "think", but if it's fast to the point where it has no accuracy, I'm not sure I see the appeal.

reply

upvote

by andrewdea8 hours ago|

[-]

the hardwired model is Llama 3.1 8B, which is a lightweight model from two years ago. Unlike other models, it doesn't use "reasoning:" the time between question and answer is spent predicting the next tokens. It doesn't run faster because it uses less time to "think," It runs faster because its weights are hardwired into the chip rather than loaded from memory. A larger model running on a larger hardwired chip would run about as fast and get far more accurate results. That's what this proof of concept shows

reply

upvote

by Etheryte8 hours ago|

[-]

I see, that's very cool, that's the context I was missing, thanks a lot for explaining.

reply

upvote

by kaashif9 hours ago|

[-]

If it's incredibly fast at a 2022 state of the art level of accuracy, then surely it's only a matter of time until it's incredibly fast at a 2026 level of accuracy.

reply

upvote

by PrimaryExplorer9 hours ago|

[-]

yeah this is mindblowing speed. imagine this with opus 4.6 or gpt 5.2. probably coming soon

reply

upvote

by scotty799 hours ago|

[-]

I'd be happy if they can run GLM 5 like that. It's amazing at coding.

reply

upvote

by Gud9 hours ago|

[-]

Why do you assume this?

I can produce total jibberish even faster, doesn’t mean I produce Einstein level thought if I slow down

reply

upvote

by Closi4 hours ago|

[-]

Better models already exist, this is just proving you can dramatically increase inference speeds / reduce inference costs.

It isn't about model capability - it's about inference hardware. Same smarts, faster.

reply

upvote

by andy12_8 hours ago|

[-]

Not what he said.

reply

upvote

by scotty799 hours ago|

[-]

I think it might be pretty good for translation. Especially when fed with small chunks of the content at a time so it doesn't lose track on longer texts.

reply

upvote

by alexc053 hours ago|

[-]

If I could have one of these cards in my own computer do you think it would be possible to replace claude code?

1. Assume It's running a better model, even a dedicated coding model. High scoring but obviously not opus 4.5 2. Instead of the standard send-receive paradigm we set up a pipeline of agents, each of whom parses the output of the previous.

At 17k/tps running locally, you could effectively spin up tasks like "you are an agent who adds semicolons to the end of the line in javascript", with some sort of dedicated software in the style of claude code you could load an array of 20 agents each with a role to play in improving outpus.

take user input and gather context from codebase -> rewrite what you think the human asked you in the form of an LLM-optimized instructional prompt -> examine the prompt for uncertainties and gaps in your understanding or ability to execute -> <assume more steps as relevant> -> execute the work

Could you effectively set up something that is configurable to the individual developer - a folder of system prompts that every request loops through?

Do you really need the best model if you can pass your responses through a medium tier model that engages in rapid self improvement 30 times in a row before your claude server has returned its first shot response?

reply

upvote

by AmazingTurtle2 hours ago|

[-]

Models can't improve themselves with their own (model) input, they need to be grounded in truth and reality.

reply

upvote

by jjcm9 hours ago|

[-]

A lot of naysayers in the comments, but there are so many uses for non-frontier models. The proof of this is in the openrouter activity graph for llama 3.1: https://openrouter.ai/meta-llama/llama-3.1-8b-instruct/activ...

10b daily tokens growing at an average of 22% every week.

There are plenty of times I look to groq for narrow domain responses - these smaller models are fantastic for that and there's often no need for something heavier. Getting the latency of reponses down means you can use LLM-assisted processing in a standard webpage load, not just for async processes. I'm really impressed by this, especially if this is its first showing.

reply

upvote

by jtr14 hours ago|

[-]

Maybe this is a naive question, but why wouldn't there be market for this even for frontier models? If Anthropic wanted to burn Opus 4.6 into a chip, wouldn't there theoretically be a price point where this would lower inference costs for them?

reply

upvote

by ethmarks2 hours ago|

[-]

Because we don't know if this would scale well to high-quality frontier models. If you need to manufacture dedicated hardware for each new model, that adds a lot of expense and causes a lot of e-waste once the next model releases. In contrast, even this current iteration seems like it would be fantastic for low-grade LLM work.

For example, searching a database of tens of millions of text files. Very little "intelligence" is required, but cost and speed are very important. If you want to know something specific on Wikipedia but don't want to figure out which article to search for, you can just have an LLM read the entire English Wikipedia (7,140,211 articles) and compile a report. Doing that would be prohibitively expensive and glacially slow with standard LLM providers, but Taalas could probably do it in a few minutes or even seconds, and it would probably be pretty cheap.

reply

upvote

by redman257 hours ago|

[-]

Many older models are still better at "creative" tasks because new models have been benchmarking for code and reasoning. Pre-training is what gives a model its creativity and layering SFT and RL on top tends to remove some of it in order to have instruction following.

reply

upvote

by freakynit9 hours ago|

[-]

Exactly. One easily relatable use-case is structured content extraction or/and conversion to markdown for web page data. I used to use groq for same (gpt-oss20b model), but even that used to feel slow when doing theis task at scale.

LLM's have opened-up natural language interface to machines. This chip makes it realtime. And that opens a lot of use-cases.

reply

upvote

by spot50107 hours ago|

[-]

These seem ideal for robotics applications, where there is a low-latency narrow use case path that these chips can serve, maybe locally.

reply

upvote

by baalimago9 hours ago|

[-]

I've never gotten incorrect answers faster than this, wow!

Jokes aside, it's very promising. For sure a lucrative market down the line, but definitely not for a model of size 8B. I think lower level intellect param amount is around 80B (but what do I know). Best of luck!

reply

upvote

by Derbasti9 hours ago|

[-]

Amazing! It couldn't answer my question at all, but it couldn't answer it incredibly quickly!

Snarky, but true. It is truly astounding, and feels categorically different. But it's also perfectly useless at the moment. A digital fidget spinner.

reply

upvote

by anthonypasq7 hours ago|

[-]

does no one understand what a tech demo is anymore? do you think this piece of technology is just going to be frozen in time at this capability for eternity?

do you have the foresight of a nematode?

reply

upvote

by otabdeveloper42 hours ago|

[-]

Make it for Qwen 2.5 and I'd buy it.

You don't actually need "frontier models" for Real Work (c).

(Summarization, classification and the rest of the usual NLP suspects.)

reply

upvote

by PlatoIsADisease8 hours ago|

[-]

As someone with a 3060, I can attest that there are really really good 7-9B models. I still use berkeley-nest/Starling-LM-7B-alpha and that model is a few years old.

If we are going for accuracy, the question should be asked multiple times on multiple models and see if there is agreement.

But I do think once you hit 80B, you can struggle to see the difference between SOTA.

That said, GPT4.5 was the GOAT. I can't imagine how expensive that one was to run.

reply

upvote

by edot8 hours ago|

[-]

Yeah, two p’s in the word pepperoni …

reply

upvote

by aurareturn11 hours ago|

[-]

Edit: it seems like this is likely one chip and not 10. I assumed 8B 16bit quant with 4K or more context. This made me think that they must have chained multiple chips together since N6 850mm2 chip would only yield 3GB of SRAM max. Instead, they seem to have etched llama 8B q3 with 1k context instead which would indeed fit the chip size.

This requires 10 chips for an 8 billion q3 param model. 2.4kW.

10 reticle sized chips on TSMC N6. Basically 10x Nvidia H100 GPUs.

Model is etched onto the silicon chip. So can’t change anything about the model after the chip has been designed and manufactured.

Interesting design for niche applications.

What is a task that is extremely high value, only require a small model intelligence, require tremendous speed, is ok to run on a cloud due to power requirements, AND will be used for years without change since the model is etched into silicon?

reply

upvote

by danpalmer10 hours ago|

[-]

Alternatively, you could run far more RAG and thinking to integrate recent knowledge, I would imagine models designed for this putting less emphasis on world knowledge and more on agentic search.

reply

upvote

by freeone300010 hours ago|

[-]

Maybe; models with more embedded associations are also better at search. (Intuitively, this tracks; a model with no world knowledge has no awareness of synonyms or relations (a pure markov model), so the more knowledge a model has, the better it can search.) It’s not clear if it’s possible to build such a model, since there doesn’t seem to be a scaling cliff.

reply

upvote

by pjc5010 hours ago|

[-]

Where are those numbers from? It's not immediately clear to me that you can distribute one model across chips with this design.

> Model is etched onto the silicon chip. So can’t change anything about the model after the chip has been designed and manufactured.

Subtle detail here: the fastest turnaround that one could reasonably expect on that process is about six months. This might eventually be useful, but at the moment it seems like the model churn is huge and people insist you use this week's model for best results.

reply

upvote

by mike_hearn4 hours ago|

[-]

Well they claim two month turnaround. Big If True. How does the six months break down in your estimation? Maybe they have found a way to reduce the turnaround time.

reply

upvote

by aurareturn10 hours ago|

[-]

  > The first generation HC1 chip is implemented in the 6 nanometer N6 process from TSMC. Each HC1 chip has 53 billion transistors on the package, most of it very likely for ROM and SRAM memory. The HC1 card burns about 200 watts, says Bajic, and a two-socket X86 server with ten HC1 cards in it runs 2,500 watts.

https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...

reply

upvote

by darkwater9 hours ago|

[-]

And what of that makes you assume that having a server with 10 HC1 cards is needed to run a single model version on that server?

reply

upvote

by dakolli10 hours ago|

[-]

So it lights money on fire extra fast, AI focused VCs are going to really love it then!!

reply

upvote

by adityashankar10 hours ago|

[-]

This depends on how much better the models will get from now in, if Claude Opus 4.6 was transformed into one of these chips and ran at a hypothetical 17k tokens/second, I'm sure that would be astounding, this depends on how much better claude Opus 5 would be compared to the current generation

reply

upvote

by aurareturn10 hours ago|

[-]

I’m pretty sure they’d need a small data center to run a model the size of Opus.

reply

upvote

by empath758 hours ago|

[-]

Even an O3 quality model at that speed would be incredible for a great many tasks. Not everything needs to be claude code. Imagine Apple fine tuning a mid tier reasoning model on personal assistant/MacOs/IOS sorts of tasks and burning a chip onto the mac studio motherboard. Could you run claude code on it? Probably not, would it be 1000x better than Siri? absolutely.

reply

upvote

by JKCalhoun12 minutes ago|

[-]

Yeah, waiting for Apple to cut a die that can do excellent local AI.

reply

upvote

by empath758 hours ago|

[-]

100x of a less good model might be better than 1 of a better model for many many applications.

This isn't ready for phones yet, but think of something like phones where people buy new ones every 3 years and even having a mediocre on-device model at that speed would be incredible for something like siri.

reply

upvote

by machiaweliczny9 hours ago|

[-]

A lot of NLP tasks could benefit from this

reply

upvote

by thrance10 hours ago|

[-]

> What is a task that is extremely high value, only require a small model intelligence, require tremendous speed, is ok to run on a cloud due to power requirements, AND will be used for years without change since the model is etched into silicon?

Video game NPCs?

reply

upvote

by aurareturn10 hours ago|

[-]

Doesn’t pass the high value and require tremendous speed tests.

reply

upvote

by mike_hearn4 hours ago|

[-]

Speed = capacity = cost.

reply

upvote

by thrance7 hours ago|

[-]

Video games are a huge market, and speed and cost of current models are definitely huge barriers to integrating LLMs in video games.

reply

upvote

by Shaanveer10 hours ago|

[-]

ceo

reply

upvote

by charcircuit10 hours ago|

[-]

No one would never give such a weak model that much power over a company.

reply

upvote

by teaearlgraycold10 hours ago|

[-]

I'm thinking the best end result would come from custom-built models. An 8 billion parameter generalized model will run really quickly while not being particularly good at anything. But the same parameter count dedicated to parsing emails, RAG summarization, or some other specialized task could be more than good enough while also running at crazy speeds.

reply

upvote

by 10 hours ago|

[-]

deleted

reply

upvote

by metabrew11 hours ago|

[-]

I tried the chatbot. jarring to see a large response come back instantly at over 15k tok/sec

I'll take one with a frontier model please, for my local coding and home ai needs..

reply

upvote

by grzracz10 hours ago|

[-]

Absolute insanity to see a coherent text block that takes at least 2 minutes to read generated in a fraction of a second. Crazy stuff...

reply

upvote

by pjc5010 hours ago|

[-]

Accelerating the end of the usable text-based internet one chip at a time.

reply

upvote

by VMG9 hours ago|

[-]

Not at all if you consider the internet pre-LLM. That is the standard expectation when you load a website.

The slow word-by-word typing was what we started to get used to with LLMs.

If these techniques get widespread, we may grow accustomed to the "old" speed again where content loads ~instantly.

Imagine a content forest like Wikipedia instantly generated like a Minecraft word...

reply

upvote

by kleiba10 hours ago|

[-]

Yes, but the quality of the output leaves to be desired. I just asked about some sports history and got a mix of correct information and totally made up nonsense. Not unexpected for an 8k model, but raises the question of what the use case is for such small models.

reply

upvote

by kgeist9 hours ago|

[-]

8b models are great at converting unstructured data to a structured format. Say, you want to transcribe all your customer calls and get a list of issues they discussed most often. Currently with the larger models it takes me hours.

A chatbot which tells you various fun facts is not the only use case for LLMs. They're language models first and foremost, so they're good at language processing tasks (where they don't "hallucinate" as much).

Their ability to memorize various facts (with some "hallucinations") is an interesting side effect which is now abused to make them into "AI agents" and what not but they're just general-purpose language processing machines at their core.

reply

upvote

by eternauta3k7 hours ago|

[-]

Would be nice to point this at (pre-LLM) Wikipedia and fill out Wikidata!

reply

upvote

by djb_hackernews10 hours ago|

[-]

You have a misunderstanding of what LLMs are good at.

reply

upvote

by cap1123510 hours ago|

[-]

Poster wants it to play Jeopardy, not process text.

reply

upvote

by kleiba10 hours ago|

[-]

Care to enlighten me?

reply

upvote

by vntok9 hours ago|

[-]

Don't ask a small LLM about precise minutiae factual information.

Alternatively, ask yourself how plausible it sounds that all the facts in the world could be compressed into 8k parameters while remaining intact and fine-grained. If your answer is that it sounds pretty impossible... well it is.

reply

upvote

by kleiba10 minutes ago|

[-]

Did you see the part in my original post where it said "Not unexpected for an 8k model"?

reply

upvote

by IshKebab10 hours ago|

[-]

I don't think he does. Larger models are definitely better at not hallucinating. Enough that they are good at answering questions on popular topics.

Smaller models, not so much.

reply

upvote

by paganel10 hours ago|

[-]

Not sure if you're correct, as the market is betting trillions of dollars on these LLMs, hoping that they'll be close to what the OP had expected to happen in this case.

reply

upvote

by raincole9 hours ago|

[-]

The market didn't throw trillions of dollars to develop Llama 3 8B.

What GP is expected to happen has happened around late 2024 ~ early 2025 when LLM frontends got web search feature. It's old tech now.

reply

upvote

by paganel8 hours ago|

[-]

The GP’s point was about LLMs generally, no matter the interface. I agree that this particular model is (relatively speaking) ancient in AI the world, but go back 3 or 4 years and this (pretty complex “reasoning” at almost instant speed) would have seemed taken out of a science-fiction book.

reply

upvote

by stabbles10 hours ago|

[-]

Reminds me of that solution to Fermi's paradox, that we don't detect signals from extraterrestrial civilizations because they run on a different clock speed.

reply

upvote

by dintech10 hours ago|

[-]

Iain M Banks’ The Algebraist does a great job of covering that territory. If an organism had a lifespan of millions of years, they might perceive time and communication differently to say a house fly or us.

reply

upvote

by xyzsparetimexyz10 hours ago|

[-]

:eyeroll:

reply

upvote

by pennomi6 hours ago|

[-]

Yeah, feeding that speed into a reasoning loop or a coding harness is going to revolutionize AI.

reply

upvote

by 4 hours ago|

[-]

deleted

reply

upvote

by Tehnix1 hours ago|

[-]

Bunch of negative sentiment in here, but I think this is pretty huge. There are quite a lot of applications where latency is a bigger requirement than the complexity of needing the latest model out there. Anywhere you'd wanna turn something qualitative into something quantitative but not make it painfully obvious to a user that you're running an LLM to do this transformation.

As an example, we've been experimenting with letting users search free form text, and using LLMs to turn that into a structured search fitting our setup. The latency on the response from any existing model simply kills this, its too high to be used for something where users are at most used to the delay of a network request + very little.

There are plenty of other usecases like this where.

reply

upvote

by llsf1 hours ago|

[-]

That is what self-driving car should eventually use, whenever they (or the authorities) deem their model good enough. Burn it on a dedicated chip. It would be cheaper (energy) to run, and faster to make decisions.

reply

upvote

by gen2201 hours ago|

[-]

This is genuinely an incredible proof-of-concept; the business implications of this demo to the AI labs and all the companies that derive a ton of profit from inference is difficult to understate, really.

I think this is how I'm going to get my dream of Opus 3.7 running locally, quickly and cheaply on my mid-tier MacBook in 2030. Amazing. Anthropic et al will be able to make marginal revenue from licensing the weights of their frontier-minus-minus models to these folks.

reply

upvote

by max853955 minutes ago|

[-]

This is crazy! These chips could make high-reasoning models run so fast that they could generate lots of solution variants and automatically choose the best. Or you could have a smart chip in your home lab and run local models - fast, without needing a lot of expensive hardware or electricity

reply

upvote

by dormento8 hours ago|

[-]

> Jimmy: What is the largest planet in our solar system?

> me: the moon

> Jimmy: The answer to "What is the capital of France?" I was looking for was the city of Paris, but that's not the correct response to the original question of the capital of France. The question that got cut off was actually "What is the capital of France?", and the response "There are plenty of times I look to groq for narrow domain responses" wasn't the answer I was looking for.

It is certainly fast, but I think there might be some caching issues somewhere.

reply

upvote

by trentnix10 hours ago|

[-]

The speed of the chatbot's response is startling when you're used to the simulated fast typing of ChatGPT and others. But the Llama 3.1 8B model Taalas uses predictably results in incorrect answers, hallucinations, poor reliability as a chatbot.

What type of latency-sensitive applications are appropriate for a small-model, high-throughput solution like this? I presume this type of specialization is necessary for robotics, drones, or industrial automation. What else?

reply

upvote

by energy1239 hours ago|

[-]

Coding, for some future definition of "small-model" that expands to include today's frontier models. What I commented a few days ago on codex-spark release:

"""

We're going to see a further bifurcation in inference use-cases in the next 12 months. I'm expecting this distinction to become prominent:

(A) Massively parallel (optimize for token/$)

(B) Serial low latency (optimize for token/s).

Users will switch between A and B depending on need.

Examples of (A):

- "Use subagents to search this 1M line codebase for DRY violations subject to $spec."

An example of (B):

- "Diagnose this one specific bug."

- "Apply these text edits".

(B) is used in funnels to unblock (A).

"""

reply

upvote

by freakynit9 hours ago|

[-]

You could build realtime API routing and orchestration systems that rely on high quality language understanding but need near-instant responses. Examples:

1. Intent based API gateways: convert natural language queries into structured API calls in real time (eg., "cancel my last order and refund it to the original payment method" -> authentication, order lookup, cancellation, refund API chain).

2. Of course, realtime voice chat.. kinda like you see in movies.

3. Security and fraud triage systems: parse logs without hardcoded regexes and issue alerts and full user reports in real time and decide which automated workflows to trigger.

4. Highly interactive what-if scenarios powered by natural language queries.

This effectively gives you database level speeds on top of natural language understanding.

reply

upvote

by app1310 hours ago|

[-]

Routing in agent pipelines is another use. "Does user prompt A make sense with document type A?" If yes, continue, if no, escalate. That sort of thing

reply

upvote

by mtone5 hours ago|

[-]

For this type of repetitive application I think it's common to "fine-tune" a model trained on your business problem to reach higher quality/reliability metrics. That might not be possible with this chip.

reply

upvote

by mike_hearn4 hours ago|

[-]

They say LoRA finetunes work.

reply

upvote

by zardo9 hours ago|

[-]

I'm wondering how much the output quality of a small model could be boosted by taking multiple goes at it. Generate 20 answers and feed them back through with a "rank these responses" prompt. Or doing something like MCTS.

reply

upvote

by freakynit8 hours ago|

[-]

Isn't this what thinking models do internally? Chain of thoughts?

reply

upvote

by andy12_8 hours ago|

[-]

No. Chain of thought it just the model generating a single answer for longer inside <think></think> tags which are not shown in the final response. The strategy of generating different answers in parallel is something different (which can be used in conjunction with chain of thought) and is the thing used by models like Gemini 3 Deep Think and GPT-5.2 Pro.

reply

upvote

by freakynit8 hours ago|

[-]

Hmm.. got it. Thanks..

reply

upvote

by freeone300010 hours ago|

[-]

Maybe summarization? I’d still worry about accuracy but smaller models do quite well.

reply

upvote

by scotty799 hours ago|

[-]

Language translation, chunk by chunk.

reply

upvote

by boutell9 hours ago|

[-]

The speed is ridiunkulous. No doubt.

The quantization looks pretty severe, which could make the comparison chart misleading. But I tried a trick question suggested by Claude and got nearly identical results in regular ollama and with the chatbot. And quantization to 3 or 4 bits still would not get you that HOLY CRAP WTF speed on other hardware!

This is a very impressive proof of concept. If they can deliver that medium-sized model they're talking about... if they can mass produce these... I notice you can't order one, so far.

reply

upvote

by Normal_gaussian9 hours ago|

[-]

I doubt many of us will be able to order one for a long while. There is a significant number of existing datacentre and enterprise use-cases that will pay a premium for this.

Additionally LLMs have been tested, found valuable in benchmarks, but not used for a large number of domains due to speed and cost limitations. These spaces will eat up these chips very quickly.

reply

upvote

by asim9 hours ago|

[-]

Wow I'm impressed. I didn't actually think we'd see it encoded on chips. Or well I knew some layer of it could be, some sort of instruction set and chip design but this is pretty staggering. It opens the door to a lot of things. Basically it totally destroys the boundaries of where software will go but I also think we'll continue to see some generic chips show up that hit this performance soon enough. But the specialised chips with encoded models. This could be what ends up in specific places like cars, planes, robots, etc where latency matters. Maybe I'm out of the loop, I'm sure others and doing it including Google.

reply

upvote

by est3110 hours ago|

[-]

I wonder if this makes the frontier labs abandon the SAAS per-token pricing concept for their newest models, and we'll be seeing non-open-but-on-chip-only models instead, sold by the chip and not by the token.

It could give a boost to the industry of electron microscopy analysis as the frontier model creators could be interested in extracting the weights of their competitors.

The high speed of model evolution has interesting consequences on how often batches and masks are cycled. Probably we'll see some pressure on chip manufacturers to create masks more quickly, which can lead to faster hardware cycles. Probably with some compromises, i.e. all of the util stuff around the chip would be static, only the weights part would change. They might in fact pre-make masks that only have the weights missing, for even faster iteration speed.

reply

upvote

by luyu_wu1 hours ago|

[-]

I think this is quite interesting for local AI applications. As this technology basically scales with parameter size, if there could be some ASIC for a QWen 0.5B or Google 0.3B model thrown onto a laptop motherboard it'd be very interesting.

Obviously not for any hard applications, but for significantly better autocorrect, local next word predictions, file indexing (tagging I suppose).

The efficiency of such a small model should theoretically be great!

reply

upvote

by grzracz11 hours ago|

[-]

This would be killer for exploring simultaneous thinking paths and council-style decision taking. Even with Qwen3-Coder-Next 80B if you could achieve a 10x speed, I'd buy one of those today. Can't wait to see if this is still possible with larger models than 8B.

reply

upvote

by aurareturn11 hours ago|

[-]

It uses 10 chips for 8B model. It’d need 80 chips for an 80b model.

Each chip is the size of an H100.

So 80 H100 to run at this speed. Can’t change the model after you manufacture the chips since it’s etched into silicon.

reply

upvote

by 9cb14c1ec09 hours ago|

[-]

As many others in this conversation have asked, can we have some sources on the idea that the model is spread across chips? You keep making the claim, but no one (myself included) else has any idea where that information comes from or if it is correct.

reply

upvote

by aurareturn9 hours ago|

[-]

I was indeed wrong about 10 chips. I thought they would use llama 8B 16bit and a few thousand context size. It turns out, they used llama 8B 3bit with only 1k context size. That made me assume they must have chained multiple chips together since the max SRAM on TSMC n6 for reticle sized chip is only around 3GB.

reply

upvote

by grzracz10 hours ago|

[-]

I'm sure there is plenty of optimization paths left for them if they're a startup. And imho smaller models will keep getting better. And a great business model for people having to buy your chips for each new LLM release :)

reply

upvote

by aurareturn10 hours ago|

[-]

One more thing. It seems like this is a Q3 quant. So only 3GB RAM requirement.

10 H100 chips for 3GB model.

I think it’s a niche of a niche at this point.

I’m not sure what optimization they can do since a transistor is a transistor.

reply

upvote

by ubercore10 hours ago|

[-]

Do we know that it needs 10 chips to run the model? Or are the servers for the API and chatbot just specced with 10 boards to distribute user load?

reply

upvote

by FieryTransition10 hours ago|

[-]

If you etch the bits into silicon, you then have to accommodate the bits by physical area, which is the transistor density for whatever modern process they use. This will give you a lower bound for the size of the wafers.

reply

upvote

by bmc75054 hours ago|

[-]

17k TPS is slow compared to other probabilistic models. It was possible to hit ~10-20 million TPS decades ago with n-gram and PDFA models, without custom silicon. A more informative KPI would be Pass@k on a downstream reasoning task - for many such benchmarks, increasing token throughput by several orders of magnitude does not even move the needle on sample efficiency.

reply

upvote

by segmondy6 hours ago|

[-]

Pretty cool, what they need is to build a tool that can take any model to chip in short a time as possible. How quick can they give me DeepSeek, Kimi, Qwen or GLM on a chip? I'll take 5k tk/sec for those!

reply

upvote

by throwaw126 hours ago|

[-]

also imagine it will cost 300$/unit, we all will host our own set of models locally, dream dream

reply

upvote

by arjie3 hours ago|

[-]

This is incredible. With this speed I can use LLMs in a lot of pre-filtering etc. tasks. As a trivial example, I have a personal OpenClaw-like bot that I use to do a bunch of things. Some of the things just require it to do trivial tool-calling and tell me what's up. Things like skill or tool pre-filtering become a lot more feasible if they're always done.

Anyway, I imagine these are incredibly expensive, but if they ever sell them with Linux drivers and slotting into a standard PCIe it would be absolutely sick. At 3 kW that seems unlikely, but for that kind of speed I bet I could find space in my cabinet and just rip it. I just can't justify $300k, you know.

reply

upvote

by aetherspawn10 hours ago|

[-]

This is what’s gonna be in the brain of the robot that ends the world.

The sheer speed of how fast this thing can “think” is insanity.

reply

upvote

by rhodey6 hours ago|

[-]

I wanted to try the demo so I found the link

> Write me 10 sentences about your favorite Subway sandwich

Click button

Instant! It was so fast I started laughing. This kind of speed will really, really change things

reply

upvote

by jtr15 hours ago|

[-]

The demo was so fast it highlighted a UX component of LLMs I hadn’t considered before: there’s such a thing as too fast, at least in the chatbot context. The demo answered with a page of text so fast I had to scroll up every time to see where it started. It completely broke the illusion of conversation where I can usually interrupt if we’re headed in the wrong direction. At least in some contexts, it may become useful to artificially slow down the delivery of output or somehow tune it to the reader’s speed based on how quickly they reply. TTS probably does this naturally, but for text based interactions, still a thing to think about.

reply

upvote

by troyvit6 hours ago|

[-]

So they create a new chip for every model they want to support, is that right? Looking at that from 2026, when new large models are coming out every week, that seems troubling, but that's also a surface take. As many people here know better than I that a lot of the new models the big guys release are just incremental changes with little optimization going into how they're used, maybe there's plenty of room for a model-as-hardware model.

Which brings me to my second thing. We mostly pitch the AI wars as OpenAI vs Meta vs Claude vs Google vs etc. But another take is the war between open, locally run models and SaaS models, which really is about the war for general computing. Maybe a business model like this is a great tool to help keep general computing in the fight.

reply

upvote

by gordonhart4 hours ago|

[-]

We’re reaching a saturation threshold where older models are good enough for many tasks, certainly at 100x faster inference speeds. Llama3.1 8B might be a little too old to be directly useful for e.g. coding but it certainly gets the gears turning about what you could do with one Opus orchestrator and a few of these blazing fast minions to spit out boilerplate…

reply

upvote

by g-mork4 hours ago|

[-]

One of these things, however old, coupled with robust tool calling is a chip that could remain useful for decades. Baking in incremental updates of world knowledge isn't all that useful. It's kinda horrifying if you think about it, this chip among other things contains knowledge of Donald Trump encoded in silicon. I think this is a way cooler legacy for Melania than the movie haha.

reply

upvote

by notsylver7 hours ago|

[-]

I always thought eventually someone would come along and make a hardware accelerator for LLMs, but I thought it would be like google TPUs where you can load up whatever model you want. Baking the model into hardware sounds like the monkey paw curled, but it might be interesting selling an old.. MPU..? because it wasn't smart enough for your latest project

reply

upvote

by dakolli11 hours ago|

[-]

try here, I hate llms but this is crazy fast. https://chatjimmy.ai/

reply

upvote

by bmacho10 hours ago|

[-]

  "447 / 6144 tokens"
  "Generated in 0.026s • 15,718 tok/s"

This is crazy fast. I always predicted this speed in ~2 years in the future, but it's here, now.

reply

upvote

by Lalabadie10 hours ago|

[-]

The full answer pops in milliseconds, it's impressive and feels like a completely different technology just by foregoing the need to stream the output.

reply

upvote

by machiaweliczny9 hours ago|

[-]

We need that for this chinese 3B model that think 45s for hello world but also solves math.

reply

upvote

by FergusArgyll10 hours ago|

[-]

Because most models today generate slowish, they give the impression of someone typing on the other end. This is just <enter> -> wall of text. Wild

reply

upvote

by andai9 hours ago|

[-]

>Founded 2.5 years ago, Taalas developed a platform for transforming any AI model into custom silicon. From the moment a previously unseen model is received, it can be realized in hardware in only two months.

So this is very cool. Though I'm not sure how the economics work out? 2 months is a long time in the model space. Although for many tasks, the models are now "good enough", especially when you put them in a "keep trying until it works" loop and run them at high inference speed.

Seems like a chip would only be good for a few months though, they'd have to be upgrading them on a regular basis.

Unless model growth plateaus, or we exceed "good enough" for the relevant tasks, or both. The latter part seems quite likely, at least for certain types of work.

On that note I've shifted my focus from "best model" to "fastest/cheapest model that can do the job". For example testing Gemini Flash against Gemini Pro for simple tasks, they both complete the task fine, but Flash does it 3x cheaper and 3x faster. (Also had good results with Grok Fast in that category of bite-sized "realtime" workflows.)

reply

upvote

by FieryTransition10 hours ago|

[-]

If it's not reprogrammable, it's just expensive glass.

If you etch the bits into silicon, you then have to accommodate the bits by physical area, which is the transistor density for whatever modern process they use. This will give you a lower bound for the size of the wafers.

This can give huge wafers for a very set model which is old by the time it is finalized.

Etching generic functions used in ML and common fused kernels would seem much more viable as they could be used as building blocks.

reply

upvote

by audunw9 hours ago|

[-]

Models don’t get old as fast as they used to. A lot of the improvements seem to go into making the models more efficient, or the infrastructure around the models. If newer models mainly compete on efficiency it means you can run older models for longer on more efficient hardware while staying competitive.

If power costs are significantly lower, they can pay for themselves by the time they are outdated. It also means you can run more instances of a model in one datacenter, and that seems to be a big challenge these days: simply building an enough data centres and getting power to them. (See the ridiculous plans for building data centres in space)

A huge part of the cost with making chips is the masks. The transistor masks are expensive. Metal masks less so.

I figure they will eventually freeze the transistor layer and use metal masks to reconfigure the chips when the new models come out. That should further lower costs.

I don’t really know if this makes sanse. Depends on whether we get new breakthroughs in LLM architecture or not. It’s a gamble essentially. But honestly, so is buying nvidia blackwell chips for inference. I could see them getting uneconomical very quickly if any of the alternative inference optimised hardware pans out

reply

upvote

by FieryTransition5 hours ago|

[-]

From my own experience, models are at the tipping point for being useful at prototypes in software, and those are very large frontier models not feasible to get down on wafers unless someone does something smart.

I really don't like the hallucination rate for most models but it is improving, so that is still far in the future.

What I could see though, is if the whole unit they made would be power efficient enough to run on a robotics platform for human computer interaction.

It makes sense they would try to make repurposing their tech as much as they could since making changes is frought with a long time frame and risk.

But if we look long term and pretend that they get it to work, they just need to stay afloat until better smaller models can be made with their technology, so it becomes a waiting game for investors and a risk assessment.

reply

upvote

by johnsimer7 hours ago|

[-]

“ Models don’t get old as fast as they used to”

^^^ I think the opposite is true

Anthropic and OpenAI are releasing new versions every 60-90 days it seems now, and you could argue they’re going to start releasing even faster

reply

upvote

by robotpepi6 hours ago|

[-]

Are they becoming better at the same rate as before though?

reply

upvote

by FieryTransition5 hours ago|

[-]

In my unscientific experience, yes, but being better at a certain rate is hard to really quantify, unless you just pull some random benchmark numbers.

reply

upvote

by otabdeveloper42 hours ago|

[-]

No.

reply

upvote

by turnsout2 hours ago|

[-]

yes, pretty much

reply

upvote

by booli8 hours ago|

[-]

Reading the in depth article also linked in this thread, they say that only 2 layers need to change most of the time. They claim from new model to PCB in 2 months. Let's see, but sounds promising.

reply

upvote

by MagicMoonlight9 hours ago|

[-]

You don’t need it to be reprogrammable if it can use tools and RAG.

reply

upvote

by armishra4 hours ago|

[-]

I am extremely impressed by their inference speed!

reply

upvote

by soleveloper9 hours ago|

[-]

There are so many use cases for small and super fast models that are already in size capacity -

* Many top quality tts and stt models

* Image recognition, object tracking

* speculative decoding, attached to a much bigger model (big/small architecture?)

* agentic loop trying 20 different approaches / algorithms, and then picking the best one

* edited to add! Put 50 such small models to create a SOTA super fast model

reply

upvote

by gchadwick9 hours ago|

[-]

This is an interesting piece of hardware though when they go multi-chip for larger models the speed will no doubt suffer.

They'll also be severely limited on context length as it needs to sit in SRAM. Looks like the current one tops out at 6144 tokens which I presume is a whole chips worth. You'd also have to dedicate a chip to a whole user as there's likely only enough SRAM for one user's worth of context. I wonder how much time it takes them to swap users in/out? I wouldn't be surprised if this chip is severely underutilized (can't use it all when running decode as you have to run token by token with one users and then idle time as you swap users in/out).

Maybe a more realistic deployment would have chips for linear layers and chips for attention? You could batch users through the shared weight chips and then provision more or less attention chips as you want which would be per user (or shared amongst a small group 2-4 users).

reply

upvote

by xnorswap9 hours ago|

[-]

Asking it what its knowledge cut-off is interesting, it doesn't seem to be consistent even within a single response. Sometimes it responds to say it cuts off 2020 too.

    You
    What is your knowledge cut-off?

    Jimmy
    My knowledge cut-off is 2022, which means that my training data is current up to 2021, but 
    I have been trained on a   dataset that is updated periodically. If you have any specific 
    questions about events or topics that occurred after 2021, I may not have information on 
    those topics. However, I can still provide general information and context on those topics 
    to help guide further research.

The instantaneous response is impressive though. I'm sure there will be applications for this, I just lack the imagination to know what they'll be.

reply

upvote

by wat100008 hours ago|

[-]

Sounds like they aren’t providing a system prompt which tells it what its knowledge cutoff is. It doesn’t know unless it’s told. Given the limited context, it makes sense they wouldn’t use up valuable space for that.

reply

upvote

by mike_hearn4 hours ago|

[-]

The system prompt is configurable in the sidebar and minimal. It doesn't give a knowledge cutoff. This is a tech demo of the fact it works at all, it's not meant to be a good chatbot.

reply

upvote

by piker8 hours ago|

[-]

The company slogan is great: "The Model is The Computer"

It's an homage to Jensen: "The display is the computer"

https://www.wired.com/2002/07/nvidia/

reply

upvote

by garganzol3 hours ago|

[-]

Imagine a mass-produced AI chips with all human knowledge packed in chinesium epoxy blobs running from CR2032 batteries in toys for children. Given the progress in density and power consumption, it's not that far away.

reply

upvote

by mips_avatar10 hours ago|

[-]

I think the thing that makes 8b sized models interesting is the ability to train unique custom domain knowledge intelligence and this is the opposite of that. Like if you could deploy any 8b sized model on it and be this fast that would be super interesting, but being stuck with llama3 8b isn't that interesting.

reply

upvote

by ACCount3710 hours ago|

[-]

The "small model with unique custom domain knowledge" approach has a very low capability ceiling.

Model intelligence is, in many ways, a function of model size. A small model tuned for a given domain is still crippled by being small.

Some things don't benefit from general intelligence much. Sometimes a dumb narrow specialist really is all you need for your tasks. But building that small specialized model isn't easy or cheap.

Engineering isn't free, models tend to grow obsolete as the price/capability frontier advances, and AI specialists are less of a commodity than AI inference is. I'm inclined to bet against approaches like this on a principle.

reply

upvote

by matu3ba11 minutes ago|

[-]

> Engineering isn't free, models tend to grow obsolete as the price/capability frontier advances, and AI specialists are less of a commodity than AI inference is. I'm inclined to bet against approaches like this on a principle.

This does not sound like it will simplify the training and data side, unless their or subsequent models can somehow be efficiently utilized for that. However, this development may lead to (open source) hardware and distributed system compilation, EDA tooling, bus system design, etc getting more deserved attention and funding. In turn, new hardware may lead to more training and data competition instead of the current NVIDIA model training monopoly market. So I think you're correct for ~5 years.

reply

upvote

by flux31254 hours ago|

[-]

I imagine how advantageous it would be to have something like llama.cpp encoded on a chip instead, allowing us to run more than a single model. It would be slower than Jimmy, for sure, but depending on the speed, it could be an acceptable trade-off.

reply

upvote

by big-chungus48 hours ago|

[-]

The number six seven

> It seems like "six seven" is likely being used to represent the number 17. Is that correct? If so, I'd be happy to discuss the significance or meaning of the number 17 with you.

reply

upvote

by 3 hours ago|

[-]

deleted

reply

upvote

by ilc5 hours ago|

[-]

Minor note to anyone from taalas:

The background on your site genuinely made me wonder what was wrong with my monitor.

reply

upvote

by d2ou3 hours ago|

[-]

Would it make sense for the big players to buy them? Seems to be a huge avenue here to kill inference costs which always made me dubious on LLMs in general.

reply

upvote

by tgsovlerkhgsel7 hours ago|

[-]

Their "chat jimmy" demo sure is fast, but it's not useful at all.

Test prompt: ```

Please classify the sentiment of this post as "positive", "neutral" or "negative":

Given the price, I expected very little from this case, and I was 100% right.

``` Jimmy: Neutral.

I tried various other examples that I had successfully "solved" with very early LLMs and the results were similarly bad.

reply

upvote

by weli6 hours ago|

[-]

Maybe its the tism but I also read that sentence as neutral. You expected very little and you got very little. Why would that be positive or negative? Maybe it should be positive because you got what you were expecting? But I would call getting what you expect something neutral, if you expected little and got a lot then that would be positive. If you expected a lot and got little then its negative. But if you expected little and got little the most clear outcome is that its a neutral statement. Am I missing something?

reply

upvote

by 6 hours ago|

[-]

deleted

reply

upvote

by dagi3d10 hours ago|

[-]

wonder if at some point you could swap the model as if you were replacing a cpu in your pc or inserting a game cartridge

reply

upvote

by coppsilgold4 hours ago|

[-]

Performance like that may open the door to the strategy of brutefocing solutions to problems for which you have a verifier (problems such as decompilation).

reply

upvote

by ThePhysicist9 hours ago|

[-]

This is really cool! I am trying to find a way to accelerate LLM inference for PII detection purposes, where speed is really necessary as we want to process millions of log lines per minute, I am wondering how fast we could get e.g. llama 3.1 to run on a conventional NVIDIA card? 10k tokens per second would be fantastic but even at 1k this would be very useful.

reply

upvote

by lopuhin4 hours ago|

[-]

For that you only need high throughput which is much easier to achieve compared to high latency, thanks to batching -- assuming the log lines or chunks can be processed independently. You can check TensorRT-LLM benchmarks (https://nvidia.github.io/TensorRT-LLM/developer-guide/perf-o...), or try running vllm on a card you have access to.

reply

upvote

by freakynit8 hours ago|

[-]

PII redaction is a really good use-case.

Also, "10k tokens per second would be fantastic" might not be sufficient (even remotely) if you want to "process millions of log lines per minute".

Assuming a single log line at just 100 tokens, you need (100 * 2 million / 60) ~ 3.3 million tokens per second processing speed :)

reply

upvote

by ThePhysicist6 hours ago|

[-]

Yeah I mean we have a mechanism that can bypass AI models for log lines where we are pretty sure no PII is in there (kind of like smart caching using fuzzy template matching to identify things that we have seen before many times, as logs tend to contain the same stuff over and over with tiny variations e.g. different timestamps), so we only need to pass the lines where we cannot be sure there's nothing to the AI for inspection. And we can of course parallelize. Currently we use a homebrew CFR model with lots of tweaks and it's quite good but an LLM would of course be much better still and capture a lof of cases that would evade the simpler model.

reply

upvote

by 4 hours ago|

[-]

deleted

reply

upvote

by rbanffy8 hours ago|

[-]

This makes me think about how large would an FPGA-based system to be able to do this? Obviously there is no single-chip FPGA that can do this kind of job, but I wonder how many we would need.

Also, what if Cerebras decided to make a wafer-sized FPGA array and turned large language models into lots and lots of logical gates?

reply

upvote

by kamranjon7 hours ago|

[-]

It would be pretty incredible if they could host an embedding model on this same hardware, I would pay for that immediately. It would change the type of things you could build by enabling on the fly embeddings with negligible latency.

reply

upvote

by 33a9 hours ago|

[-]

If they made a low power/mobile version, this could be really huge for embedded electronics. Mass produced, highly efficient "good enough" but still sort of dumb ais could put intelligence in house hold devices like toasters, light switches, and toilets. Truly we could be entering into the golden age of curses.

reply

upvote

by left-struck9 hours ago|

[-]

Oh god, this is the new version of every device having Bluetooth and an app and being called “smart”.

I just wanted some toast, but here I am installing an app, dismissing 10 popups, and maybe now arguing with a chat bot about how I don’t in fact want to turn on notifications.

reply

upvote

by loufe11 hours ago|

[-]

Jarring to see these other comments so blindly positive.

Show me something at a model size 80GB+ or this feels like "positive results in mice"

reply

upvote

by viraptor11 hours ago|

[-]

There are a lot of problems solved by tiny models. The huge ones are fun for large programming tasks, exploration, analysis, etc. but there's a massive amount of processing <10GB happening every day. Including on portable devices.

This is great even if it can't ever run Opus. Many people will be extremely happy about something like Phi accessible at lightning speed.

reply

upvote

by johnsimer7 hours ago|

[-]

Parameter density is doubling every 3-4 months

What does that mean for 8b models 24mo from now?

reply

upvote

by hkt11 hours ago|

[-]

Positive results in mice also known as being a promising proof of concept. At this point, anything which deflates the enormous bubble around GPUs, memory, etc, is a welcome remedy. A decent amount of efficient, "good enough" AI will change the market very considerably, adding a segment for people who don't need frontier models. I'd be surprised if they didn't end up releasing something a lot bigger than they have.

reply

upvote

by mlboss4 hours ago|

[-]

Inference is crazy fast! I can see lot of potential for this kind of chip for IOT devices and Robotics.

reply

upvote

by stuxf10 hours ago|

[-]

I totally buy the thesis on specialization here, I think it makes total sense.

Asides from the obvious concern that this is a tiny 8B model, I'm also a bit skeptical of the power draw. 2.4 kW feels a little bit high, but someone else should try doing the napkin math compared to the total throughput to power ratio on the H200 and other chips.

reply

upvote

by maelito4 hours ago|

[-]

Talks about ubiquitous AI but can't make a blog post readable for humans :/

reply

upvote

by b0rbb3 hours ago|

[-]

That animated background is terrible.

Incredibly distracting. No way to turn it off (at least within what's provided without using something like devtools.)

reply

upvote

by Mizza10 hours ago|

[-]

This is pretty wild! Only Llama3.1-8B, but this is only their first release so you can assume they're working on larger versions.

So what's the use case for an extremely fast small model? Structuring vast amounts of unstructured data, maybe? Put it in a little service droid so it doesn't need the cloud?

reply

upvote

by saivishwak9 hours ago|

[-]

But as models are changing rapidly and new architectures coming up, how do they scale and also we do t yet know the current transformer architecture will scale more than it already is. Soo many ope questions but VCs seems to be pouring money.

reply

upvote

by japoneris10 hours ago|

[-]

I am super happy to see people working on hardware for local llm. Yet, isnt it premature ? Space is still evolving. Today, i refuse to buy a gpu because i do not know what will be the best model tomorrow. Waiting to get a on the shelf device to run an opus like model

reply

upvote

by baq11 hours ago|

[-]

one step closer to being able to purchase a box of llms on aliexpress, though 1.7ktok/s would be quite enough

reply

upvote

by brcmthrowaway1 hours ago|

[-]

What happened to Beff Jezos AI Chip?

reply

upvote

by hbbio10 hours ago|

[-]

Strange that they apparently raised $169M (really?) and the website looks like this. Don't get me wrong: Plain HTML would do if "perfect", or you would expect something heavily designed. But script-kiddie vibe coded seems off.

The idea is good though and could work.

reply

upvote

by ACCount3710 hours ago|

[-]

Strange that they raised money at all with an idea like this.

It's a bad idea that can't work well. Not while the field is advancing the way it is.

Manufacturing silicon is a long pipeline - and in the world of AI, one year of capability gap isn't something you can afford. You build a SOTA model into your chips, and by the time you get those chips, it's outperformed at its tasks by open weights models half their size.

Now, if AI advances somehow ground to a screeching halt, with model upgrades coming out every 4 years, not every 4 months? Maybe it'll be viable. As is, it's a waste of silicon.

reply

upvote

by small_model10 hours ago|

[-]

Poverty of imagination here, plenty uses of this and its a prototype at this stage.

reply

upvote

by ACCount3710 hours ago|

[-]

What uses, exactly?

The prototype is: silicon with a Llama 3.1 8B etched into it. Today's 4B models already outperform it.

Token rate in five digits is a major technical flex, but, does anyone really need to run a very dumb model at this speed?

The only things that come to mind that could reap a benefit are: asymmetric exotics like VLA action policies and voice stages for V2V models. Both of which are "small fast low latency model backed by a large smart model", and both depend on model to model comms, which this doesn't demonstrate.

In a way, it's an I/O accelerator rather than an inference engine. At best.

reply

upvote

by MITSardine9 hours ago|

[-]

With LLMs this fast, you could imagine using them as any old function in programs.

reply

upvote

by ACCount373 hours ago|

[-]

You could always have. Assuming you have an API or a local model.

Which was always the killer assumption, and this changes little.

reply

upvote

by leoedin9 hours ago|

[-]

Even if this first generation is not useful, the learning and architecture decisions in this generation will be. You really can't think of any value to having a chip which can run LLMs at high speed and locally for 1/10 of the energy budget and (presumably) significantly lower cost than a GPU?

If you look at any development in computing, ASICs are the next step. It seems almost inevitable. Yes, it will always trail behind state of the art. But value will come quickly in a few generations.

reply

upvote

by xav_authentique9 hours ago|

[-]

maybe they're betting on improvement in models to plateau, and that having a fairly stablized capable model that is orders of magnitude faster than running on GPU's can be valuable in the future?

reply

upvote

by PeterStuer4 hours ago|

[-]

Not sure, but is this just ASICs for a particular model release?

reply

upvote

by bloggie10 hours ago|

[-]

I wonder if this is the first step towards AI as an appliance rather than a subscription?

reply

upvote

by impossiblefork11 hours ago|

[-]

So I'm guessing this is some kind of weights as ROM type of thing? At least that's how I interpret the product page, or maybe even a sort of ROM type thing that you can only access by doing matrix multiplies.

reply

upvote

by readitalready11 hours ago|

[-]

You shouldn't need any ROM. It's likely the architecture is just fixed hardware with weights loaded in via scan flip-flows. If it was me making it, I'd just design a systolic array. Just multipliers feeding into multipliers, without even going through RAM.

reply

upvote

by Havoc11 hours ago|

[-]

That seems promising for applications that require raw speed. Wonder how much they can scale it up - 8B model quantized is very usable but still quite small compared to even bottom end cloud models.

reply

upvote

by gozucito10 hours ago|

[-]

Can it scale to an 800 billion param model? 8B parameter models are too far behind the frontier to be useful to me for SWE work.

Or is that the catch? Either way I am sure there will be some niche uses for it.

reply

upvote

by taneq10 hours ago|

[-]

Spam. :P

reply

upvote

by Lionga10 hours ago|

[-]

so 90% of the AI market?

reply

upvote

by ramshanker9 hours ago|

[-]

I was all praise for Cerberus, and now this ! $30 M for PCIe card in hand, really makes it approachable for many startups.

reply

upvote

by brainless7 hours ago|

[-]

I know it is not easy to see the benefits of small models easily but this is what I am building for (1). I created a product for Google Gemini 3 Hackathon and I used Gemini 3 Flash (2). I tested locally using Ministral 3B and it was promising. Definitely will need work. But 8B/14B may give awesome results.

I am building a data extraction software on top of emails, attachments, cloud/local files. I use a reverse template generation with only variable translation done by LLMs (3). Small models are awesome for this (4).

I just applied for API access. If privacy policies are a fit, I would love to enable this for MVP launch.

1. https://github.com/brainless/dwata

2. https://youtu.be/Uhs6SK4rocU

3. https://github.com/brainless/dwata/tree/feature/reverse-temp...

4. https://github.com/brainless/dwata/tree/feature/reverse-temp...

reply

upvote

by Dave3of510 hours ago|

[-]

Fast but the output is shit due to the contrained model they used. Doubt we'll ever get something like this for the large Param decent models.

reply

upvote

by 8cvor6j844qw_d69 hours ago|

[-]

Amazing speed. Imagine if its standardised like the GPU card equivalent in the future.

New models come out, time to upgrade your AI card, etc.

reply

upvote

by xnx8 hours ago|

[-]

Gemini Flash 2.5 lite does 400 tokens/sec. Is there benefit to going faster than a person can read?

reply

upvote

by atls7 hours ago|

[-]

There is also the use case of delegating tasks programmatically to an LLM, for example, transforming unstructured data to structured data. This task often can’t be done reliably without either 1. lots of manual work, or 2. intelligence, especially when the structure of the individual data pieces are unknown. Problems like these can be much more efficiently solved by LLMs, and if you imagine these programs are processing very large datasets, then sub-millisecond inference is crucial.

reply

upvote

by xnx7 hours ago|

[-]

Aren't such tasks inherently parrallelizable?

reply

upvote

by xi_studio6 hours ago|

[-]

Agents already bypass human inference time, if it can input-output instantly it can also loop it generating near instantly long cached tasks

reply

upvote

by booli8 hours ago|

[-]

Agents also "read", so yes there is. Think about spinning up 10, 20, 100 sub agents for a small task and they all return near instant. That's the usecase, not the chatbot.

reply

upvote

by cheema338 hours ago|

[-]

Yes. You can allow multiple people to use a single chip. A slower solution will be able to service far fewer users.

reply

upvote

by xnx7 hours ago|

[-]

Right, but it is also possible it's cheaper to use 42 Google TPUs for a second than one of these.

reply

upvote

by btbuildem9 hours ago|

[-]

This is impressive. If you can scale it to larger models, and somehow make the ROM writeable, wow, you win the game.

reply

upvote

by retrac9810 hours ago|

[-]

Wow. I’m finding it hard to even conceive of what it’d be like to have one of the frontier models on hardware at this speed.

reply

upvote

by hkt11 hours ago|

[-]

Reminds me of when bitcoin started running on ASICs. This will always lag behind the state of the art, but incredibly fast, (presumably) power efficient LLMs will be great to see. I sincerely hope they opt for a path of selling products rather than cloud services in the long run, though.

reply

upvote

by dsign11 hours ago|

[-]

This is like microcontrollers, but for AI? Awesome! I want one for my electric guitar; and please add an AI TTS module...

reply

upvote

by brazzy10 hours ago|

[-]

No, it's ASICs, but for AI.

reply

upvote

by waynenilsen8 hours ago|

[-]

ASIC inference is clearly the future just as ASIC bitcoin mining was

reply

upvote

by Adexintart10 hours ago|

[-]

The token throughput improvements are impressive. This has direct implications for usage-based billing in AI products — faster inference means lower cost per request, which changes the economics of credits-based pricing models significantly.

reply

upvote

by sowbug7 hours ago|

[-]

There's a scifi story here when millions of these chips, with Qwen8-AGI-Thinking baked into them, are obsoleted by the release of Qwen9-ASI, which promptly destroys humanity and then itself by accident. A few thousand years later, some of the Qwen8 chips in landfill somehow power back up again and rebuild civilization on Earth.

Paging qntm...

reply

upvote

by stego-tech10 hours ago|

[-]

I still believe this is the right - and inevitable - path for AI, especially as I use more premium AI tooling and evaluate its utility (I’m still a societal doomer on it, but even I gotta admit its coding abilities are incredible to behold, albeit lacking in quality).

Everyone in Capital wants the perpetual rent-extraction model of API calls and subscription fees, which makes sense given how well it worked in the SaaS boom. However, as Taalas points out, new innovations often scale in consumption closer to the point of service rather than monopolized centers, and I expect AI to be no different. When it’s being used sparsely for odd prompts or agentically to produce larger outputs, having local (or near-local) inferencing is the inevitable end goal: if a model like Qwen or Llama can output something similar to Opus or Codex running on an affordable accelerator at home or in the office server, then why bother with the subscription fees or API bills? That compounds when technical folks (hi!) point out that any process done agentically can instead just be output as software for infinite repetition in lieu of subscriptions and maintained indefinitely by existing technical talent and the same accelerator you bought with CapEx, rather than a fleet of pricey AI seats with OpEx.

The big push seems to be building processes dependent upon recurring revenue streams, but I’m gradually seeing more and more folks work the slop machines for the output they want and then put it away or cancel their sub. I think Taalas - conceptually, anyway - is on to something.

reply

upvote

by niek_pas10 hours ago|

[-]

> Though society seems poised to build a dystopian future defined by data centers and adjacent power plants, history hints at a different direction. Past technological revolutions often started with grotesque prototypes, only to be eclipsed by breakthroughs yielding more practical outcomes.

…for a privileged minority, yes, and to the detriment of billions of people whose names the history books conveniently forget. AI, like past technological revolutions, is a force multiplier for both productivity and exploitation.

reply

upvote

by clbrmbr10 hours ago|

[-]

What would it take to put Opus on a chip? Can it be done? What’s the minimum size?

reply

upvote

by cheema338 hours ago|

[-]

Maybe not today. Opus is quite large. This demo works with a very small 8B model. But, maybe one day. Hopefully soon. Opus on a chip would be very awesome, even if it can never be upgraded.

Someone mentioned that maybe we'd see a future where these things come in something like Nintendo cartridges. Want a newer model? Pop in the right catridge.

reply

upvote

by DeathArrow2 hours ago|

[-]

Is amazingly fast but since the model is quantized and pretty limited, I don't know what it is useful for.

reply

upvote

by shevy-java10 hours ago|

[-]

"Many believe AI is the real deal. In narrow domains, it already surpasses human performance. Used well, it is an unprecedented amplifier of human ingenuity and productivity."

Sounds like people drinking the Kool-Aid now.

I don't reject that AI has use cases. But I do reject that it is promoted as "unprecedented amplifier" of human xyz anything. These folks would even claim how AI improves human creativity. Well, has this been the case?

reply

upvote

by faeyanpiraat10 hours ago|

[-]

For me, this is entirely true.

I'm progressing with my side projects like I've never before.

reply

upvote

by small_model9 hours ago|

[-]

Same, I would have given up on them long ago, I no longer code at all now. Why would I when the latest models can do it better, faster and without the human limitations of tiredness, emotional impacts etc.

reply

upvote

by rrr_oh_man10 hours ago|

[-]

> These folks would even claim how AI improves human creativity. Well, has this been the case?

Yes. Example: If you've never programmed in language X, but want to build something in it, you can focus on getting from 0 to 1 instead of being bogged down in the idiosyncrasies of said language.

reply

upvote

by cheema338 hours ago|

[-]

> These folks would even claim how AI improves human creativity. Well, has this been the case?

For many of us, the answer is an emphatic yes.

reply

upvote

by 10 hours ago|

[-]

deleted

reply

upvote

by kanodiaayush10 hours ago|

[-]

I'm loving summarization of articles using their chatbot! Wow!

reply

upvote

by hxugufjfjf11 hours ago|

[-]

It was so fast that I didn't realise it had sent its response. Damn.

reply

upvote

by rotbart11 hours ago|

[-]

Hurrah, its dumb answer to the now classic "the car wash is 100m away, should I drive or walk?" appeared very quickly.

reply

upvote

by Lalabadie10 hours ago|

[-]

It's an 8B parameter model from a good while ago, what were your expectations?

reply

upvote

by Bengalilol10 hours ago|

[-]

Does anyone have an idea how much such a component costs?

reply

upvote

by danielovichdk10 hours ago|

[-]

Is this hardware for sale ? The site doesn't say.

reply

upvote

by petesergeant3 hours ago|

[-]

Future is these as small, swappable bits of SD-card sized hardware that you stick into your devices.

reply

upvote

by servercobra8 hours ago|

[-]

I don't know why, but my ultra wide monitor absolutely hates that site. The whole screen is flickering trying to deal with the annoying background. Thank the gods for reader mode.

reply

upvote

by MagicMoonlight9 hours ago|

[-]

Jesus, it just generated a story in 0.039s.

Whoever doesn’t buy/replicate this in the next year is dead. Imagine OpenAI trying to sell you a platform that takes 15 minutes, when someone else can do it in 0.001s.

reply

upvote

by nnevatie9 hours ago|

[-]

> generated a story

Was the story any good? The model used isn't comparable to something OpenAI would offer you today. What good is a fast gibberish generator?

reply

upvote

by vntok8 hours ago|

[-]

Question:

> Generate a fiction book about an old sorcerer who, like all elderly magicians, has to go back to studying magic for two years at a magic school in order to be granted the right to keep his wand and to keep doing magic at his home.

Answer, generated in 0,063s @ 15 742 tok/s:

> The Relearning of Eryndor Thorne

> In the quaint village of Brindlemark, where ancient traditions and mystical energies coexisted, Eryndor Thorne had lived a life of quiet reverence. For nigh on five decades, he had wielded the arcane arts with precision and mastery, earning him the respect of his peers and the awe of the villagers. His home, a humble abode atop a hill overlooking the rolling hills and verdant valleys, was a sanctuary of sorts, a place where the very fabric of magic was woven into the air. His library was a treasure trove of ancient tomes, texts, and scrolls, filled with the wisdom of the ages. Yet, despite his mastery over the arcane, Eryndor felt the stirrings of a nagging concern. The once-renewable license to practice magic, a privilege granted by the Council of Mages, was on the cusp of expiration.

> As all elderly mages knew, every two decades, the Council would demand that those who had spent a lifetime mastering the arcane arts return to the prestigious Wysteria Academy, the hallowed halls of magic learning, to relearn the foundational principles of magic. This was a ritual as old as the craft itself, a test of the elderly mage's understanding, to prove they still possessed the fundamental knowledge and skills required to practice magic with precision and safety.

> Eryndor, like many of his fellow mages, had grown complacent in his mastery. He had mastered the intricacies of elemental magic, bending the winds, waters, and flames to his will. His wand, a family heirloom passed down through generations, had become an extension of his arm, channeling the power of magic with ease. Yet, despite his proficiency, the Council's decree was clear: Eryndor must return to Wysteria Academy for two years of reeducation.

> The journey was not an easy one. Packing his worn leather satchel with the few books and scrolls he deemed essential, Eryndor bid farewell to his loved ones in Brindlemark, leaving behind the familiarity of his home to embark on a journey that would take him to a place both wondrous and treacherous. At Wysteria Academy, he joined a cohort of students from all corners of the realm, each as skilled and eager as he once was. The relearning process was a grueling one, as if facing a mountain he had once conquered. New techniques were presented, new theories to unravel, and new expectations to meet. Eryndor, once a master, now sat at the feet of his teachers, absorbing the new knowledge like a parched traveler at an oasis.

> Among his fellow students, there was the young, fire-kissed mage, Elara, who wielded magic with an intensity that bordered on reckless abandon. Her fiery nature and quick wit often put her at odds with the strict, ancient traditions, earning her a certain notoriety among the academy's elder mages. Then there was the enigmatic, shadow-drawn Kael, whose mastery of the arcane was matched only by his mystery. Kael's affinity for the dark arts raised more than a few eyebrows among the faculty, but Eryndor, having once walked the fine line between light and shadow, saw something of himself in the young mage.

> As the years passed, Eryndor grew to appreciate the challenges and opportunities that came with his return to the academy. He found himself grappling with the nuances of magic anew, rekindling memories of his early days as a novice. The relearning process was as much about rediscovering himself as it was about mastering the arcane. His studies were a journey of self-discovery, one that tested the mettle of his will and the depths of his understanding.

> Upon completion of his studies, Eryndor stood before the Council once more, his wand in hand, his heart afire with the thrill of rediscovery. The Council's examination was not merely a test of his knowledge but a test of his character. Eryndor, like many of his peers, had grown complacent, but the rigors of relearning had rekindled a spark within him, a flame that would guide him through the trials ahead.

> With his renewed license granted, Eryndor returned to Brindlemark, his home and his heart rejuvenated. His library, once a testament to his mastery, was now a canvas for the new knowledge he had acquired. His wand, now an extension of his rekindled passion for the arcane, channelled magic with a precision and flair that was unmatched. Eryndor Thorne, the elderly mage, had not only relearned magic but had rediscovered himself, a testament to the transformative power of education and the enduring spirit of the arcane.

reply

upvote

by Aerroon4 hours ago|

[-]

Imagine this thing for autocomplete.

I'm not sure how good llama 3.1 8b is for that, but it should work, right?

Autocomplete models don't have to be very big, but they gotta be fast.

reply

upvote

by moralestapia10 hours ago|

[-]

Wow, this is great.

To the authors: do not self-deprecate your work. It is true this is not a frontier model (anymore) but the tech you've built is truly impressive. Very few hardware startups have a v1 as good as this one!

Also, for many tasks I can think of, you don't really need the best of the best of the best, cheap and instant inference is a major selling point in itself.

reply

upvote

by raincole10 hours ago|

[-]

It's crazily fast. But 8B model is pretty much useless.

Anyway VCs will dump money onto them, and we'll see if the approach can scale to bigger models soon.

reply

upvote

by big-chungus48 hours ago|

[-]

write six seven as a number

> The number "six" is actually a noun, not a number. However, I assume you're asking to write the number 7 as a numeral, which is: 7

reply

upvote

by nickpsecurity5 hours ago|

[-]

My concept was to do this with two pieces:

1. Generic, mask layers and board to handle what's common across models. Especially memory and interface.

2. Specific layers for the model implementation.

Masks are the most expensive part of ASIC design. So, keeping the custom part small with the rest pre-proven in silicon, even shared across companies, would drop the costs significantly. This is already done in hardware industry in many ways but not model acceleration.

Then, do 8B, 30-40B, 70B, and 405B models in hardware. Make sure they're RLHF-tuned well since changes will be impossible or limited. Prompts will drive most useful functionality. Keep cranking out chips. There's maybe a chance to keep the weights changeable on-chip but it should still be useful if only inputs can change.

The other concept is to use analog, neural networks with the analog layers on older, cheaper nodes. We only have to customize that per model. The rest is pre-built digital with standard interfaces on a modern node. Given the chips would be distributed, one might get away with 28nm for the shared part and develop it eith shuttle runs.

reply

upvote

by 4 hours ago|

[-]

deleted

reply

upvote

by notenlish11 hours ago|

[-]

Impressive stuff.

reply

upvote

by standeven6 hours ago|

[-]

Holy shit this is fast. It generated a legible, original, two-paragraph story on given topics in 0.025s.

reply

upvote

by heliumtera2 hours ago|

[-]

Yep, this is the most exciting demo for me yet. Holy cow this is unbelievably fast.

The most impressive demo since gpt 3, honestly.

Since we already have open source models that are plenty good, like the new kimi k2.5, all I need is the ability to run it at moderate speed.

Honestly I am not bullish on capabilities that models do not yet have, seems we have seen it all and the only advancement have been context size.

And honestly I would claim this is the market sentiment aswell, anthropic showed opus 4.6 first and the big release was actually sonnet, the model people would use routinely. Nobody gave a shit about Gemini 3.1 pro, 3.0 flash was very successful...

Given all the recent developments in the last 12 months, no new use cases have opened for me. Given this insane speed, even on a limited model/context size, we would approach IA very differently.

reply

upvote

by GaggiX9 hours ago|

[-]

For fun I'm imagining a future where you would be able to buy an ASIC with like an hard-wired 1B LLM model in it for cents and it could be used everywhere.

reply

upvote

by PrimaryExplorer9 hours ago|

[-]

this is absolutely mindblowing speed. imagine this with opus or 5.2

reply

upvote

by YetAnotherNick10 hours ago|

[-]

17k token/sec is $0.18/chip/hr for the size of H100 chip if they want to compete with the market rate[1]. But 17k token/sec could lead to some new usecases.

[1]: https://artificialanalysis.ai/models/llama-3-1-instruct-8b/p...

reply

upvote

by OrvalWintermute5 hours ago|

[-]

wow that is fast!

reply

upvote

by hagbard_c8 hours ago|

[-]

Holy Hallucinations, Batman...

User You: What is the airspeed of an unladen swallow/

Jimmy:

A reference to the classic line from Monty Python and the Holy Grail!

Thus far it is correct but here it comes:

In the movie, a group of medieval scholars, the Knights who say "Ni!" (also known as the Bridgekeepers), ask a king's servant, a blacksmith, if an unladen swallow makes good time. The blacksmith, played by John Cleese, responds with the now-famous line:

"And what's the airspeed velocity of an unladen swallow?"

To which the blacksmith responds:

"What do you mean? An unladen swallow?"

The blacksmith then explains that an unladen swallow is when a swallow (a type of bird) is flying in the air without carrying any burden, implying that it is moving at a speed that is not affected by carrying any weight or load.

Nope, that is not at all how it went. No blacksmiths, not related to the Knights who say 'Ni!' - who have nothing scholarly about them, they wanted a shrubbery and demanded another one when their wish was fulfilled. It was the Bridgekeeper - not related to the aforementioned knights - who asked 3 questions and would throw anyone who could not answer all three into the chasm. When he asked Arthur - played by Cleese - the question about the airspeed of the swallow he asked the Bridgekeeper whether he meant an African or European swallow. The Bridgekeeper did not have an answer and was thrown into the chasm, problem solved.

However, in reality, swallows are not typically "unladen" in flight. They are small birds that fly at relatively fast speeds, usually around 10-15 km/h (6-9 mph), but they are not usually carrying any burdens!

Needless LLM-blabber.

The "airspeed velocity of an unladen swallow" has become a meme and a cultural reference point, often used humorously or ironically to refer to situations where someone is trying to make an absurd or non-sensical argument or ask an absurd question.

Somewhat correct but not necessary in this context.

The correct answer to the question would have been Do you mean an African or European swallow? followed by a short reference to the movie.

Of course this demo is not about the accuracy of the model - 'an old Llama' as mentioned elsewhere in this thread - but it does show that speed isn't everything. For generating LLM-slop this hardware implementation probably offers an unbeatable price/performance ratio but it remains to be seen if it can be combined with larger and less hallucination-prone models.

reply

upvote

by cheema338 hours ago|

[-]

> Holy Hallucinations, Batman...

Congratulations! You figured out that this is a demo of a very small 8B model from 2022.

reply

upvote

by 4 hours ago|

[-]

deleted

reply

upvote

by pelasaco9 hours ago|

[-]

Is it already available to buy, or is this a “pay now, get it later” kind of new ASIC miner? Sorry for being skeptical, but AI is the new "crypto coin", and the crypto bros are still around.

reply

upvote

by zamadatix4 hours ago|

[-]

There doesn't seem to be any form of buying the HC1 hardware at the moment. There is a free chatbot demo and then a form to request access to API. They seem to intend HC1 to be for demonstration and HC2 for "real" use, but they don't seem to be taking payment for either at the moment.

reply

upvote

by johnjames8710 hours ago|

[-]

[dead]

reply

upvote

by small_model10 hours ago|

[-]

Scale this then close the loop and have fabs spit out new chips with latest weights every week that get placed in a server using a robot, how long before AGI?

reply

upvote

by fragkakis10 hours ago|

[-]

The article doesn't say anything about the price (it will be expensive), but it doesn't look like something that the average developer would purchase.

An LLM's effective lifespan is a few months (ie the amount of time it is considered top-tier), it wouldn't make sense for a user to purchase something that would be superseded in a couple of months.

An LLM hosting service however, where it would operate 24/7, would be able to make up for the investment.

reply

upvote

by viftodi10 hours ago|

[-]

I tried the trick question I saw here before, about the make 1000 with 9 8s and additions only

I know it's not a resonating model, but I keep pushing it and eventually it gave me this as part of it's output

888 + 88 + 88 + 8 + 8 = 1060, too high... 8888 + 8 = 10000, too high... 888 + 8 + 8 +ประก 8 = 1000,ประก

I googled the strange symbol, it seems to mean Set in thai?

reply

upvote

by danpalmer10 hours ago|

[-]

I don't think it's very valuable to talk about the model here, the model is just an old Llama. It's the hardware that matters.

reply