Performance per dollar is getting faster and cheaper

upvote

Performance per dollar is getting faster and cheaper

(www.wafer.ai)

325 points

by latchkey20 hours ago |

upvote

by minraws19 hours ago|

[-]

Can you folks add performance per watt as a metric to these comparisons, I honestly want to understand where AMD fits in the stack in terms of actual performance to dollars. I have had talks with companies wanting to build data centers outside of US and find it hard to source anything Nvidia in sufficient capacity and scale.

If AMD is competitive performance per watt and roughly reliable in terms of software support which is what most folks outside of US prioritize above all else, since outside of China and US electricity tends to at a relative premium.

Maybe if they make smaller data centers viable at the right price, AMD could be part of the stack outside of US where ever Nvidia is more limited in supply. Though I have genuinely no idea what sourcing an AMD GPU looks like.

I have never seen a company use AMD outside of wafer and a couple others mostly in US.

Genuinely intriguing or maybe not really (could be this stuff is common knowledge) and I am just stuck in my Nvidia bubble here.

reply

upvote

by kingstnap17 hours ago|

[-]

A DGX B200 costs like ~$0.5 M and uses around 14 kW.

If you plan to run it straight for 8 years 100% max usage thats around 1 GWhr.

A gigawatt hour is a lot of energy but its not that much compared to the price of the actual machine. In Germany for example with its expensive energy thats about €100k worth, which spread over 8 years is pretty minor compared to the up front half mill.

The real issue with high power consumption is not really the cost of energy but the limited powersupply you can get for a datacenter. A more efficient setup is highly desirable because it means you can fit more in the limited power hookup.

reply

upvote

by minraws3 hours ago|

[-]

It's not even about the costs, getting enough power for a large datacenter is impractically hard in most of the world at a single location.

If it's efficient and the power costs of not just ongoing costs but the upfront setup is lower that makes a lot different scales of data centers practical, especially for inference which doesn't need massive super clusters.

You can't just fire up gas turbines everywhere like US Data centers are doing. I am not even sure if that's legal in US...

Note you have to plan for peak usage and a lot of stuff large scale data centers are insane infrastructure projects.

Nvidia is both supply and price constrainted, sure if you are willing to pay over 0.5M$ you might get some, but if you try to balance out price to costs by going slightly lower on the pole you realize just how much more expensive Nvidia truly feels like AMD has a lot of margin to under cut them if they want to.

reply

upvote

by bayindirh4 hours ago|

[-]

> but the limited powersupply you can get for a datacenter.

Since many people haven't seen 10MW cabling for a data center or how a big GPU server is cabled, they naturally imagine connecting servers is akin to plugging an appliance to a wall.

When the electricity provider says "I neither have the capacity, nor the required cables in that area", thing gets real.

reply

upvote

by willis9362 hours ago|

[-]

What they're really asking the authors is "can you not lie about performance cost and do proper accounting?". You can spin any story if you cherry pick your framing sufficiently. Stopping right at the silicon packaging boundary is as meaningless as it seems.

The article is highly qualified but the headline is not. If they are not making general statements then they shouldn't open with them.

reply

upvote

by dannyw14 hours ago|

[-]

It’s more than power supply. Cooling and ventilation becomes a MUCH bigger deal at rack scale, and that costs electricity too.

reply

upvote

by bayindirh4 hours ago|

[-]

With liquid cooling technologies (direct or rear door heat exchange), cooling efficiently is easier when compared to a decade ago, and it's pretty efficient when you compare the power consumption numbers (server total vs. cooling total).

See PUE (Power Usage Effectiveness) for its scientific form.

reply

upvote

by thereisnospork12 hours ago|

[-]

Cooling demand is only fractional with respect to the load: cooling 1MW of heat will only cost a few 10's to low 100's of kW, depending on the specifics. 10-20% overhead on cooling is probably a close enough estimate for napkin math.

reply

upvote

by psychoslave12 hours ago|

[-]

And datacenters have impact on everything around them. If at the end of the day to result is a few more yachts and jets and, a lot more of miserable humans starving in ruined ecosystems, maybe that’s not the best go-to direction.

reply

upvote

by butvacuum11 hours ago|

[-]

You say they have a large impact, but having lived somewhere with some of the largest data centers- they very much don't. At least not more so then any other structure that paves over greenery.

love to debate actual discission points. pull up "datacenter dfw" on google maps for mine.

reply

upvote

by ffsm811 hours ago|

[-]

The people having glass literally break from the vibrations would probably disagree with your opinion

https://youtu.be/_bP80DEAbuo?is=sg09k66iutKFIFSo

Yet here we are, discussing "data center" as if they're standardized and of similar (nose) isolation.

There are no meaningful regulations in building them, and they can be incredibly polluting. So your experience with a potentially well isolated one is sadly not the norm going forward. And we don't even know how close you lived, if you're eg talking about "within 5km/3miles" then your experience would also have little value in this discussion in general.

reply

upvote

by jml7c510 hours ago|

[-]

>The people having glass literally break from the vibrations would probably disagree with your opinion

Can you cite a source for this? It's not in the video, as far as I can tell.

I would be wary of Benn Jordan's videos. They are full of mistakes and misrepresentations, as Andy Masley has convincingly demonstrated: https://blog.andymasley.com/p/contra-benn-jordan-data-center...

I recall seeing Benn Jordan's responses on Bluesky and thinking they were quite poor. He was unwilling to admit to mistakes, and kept trying to grasp at newly searched papers that didn't actually support his arguments.

reply

upvote

by redsocksfan458 hours ago|

[-]

[dead]

reply

upvote

by hypfer9 hours ago|

[-]

Benn unfortunately is one of those people that actually feel stuff, which is a trait that easily gets exploited by bad actors.

Indeed, he shot himself in the foot there pretty bad, but I would argue that that was just the result of successful Agitation.

I would personally strongly prefer being in the same room with Benn compared with Andy, because one of them is authentic, while the other is calculating. Though, arguably, Benn has been catching up on that lately too.

But yeah, taking stuff with a grain of salt should be the default regardless of the person speaking.

reply

upvote

by apublicfrog10 hours ago|

[-]

The fact that people have lived and worked near data centres for decades and didn't even know what the term meant - let alone be adversely impacted by them - probably indicates they're broadly an non issue. All of a sudden out of nowhere, AI and data centres got intermingled by the media and now people seem to have big issues with them.

reply

upvote

by bayindirh4 hours ago|

[-]

Because the dynamics have shifted enormously inside the rack.

10 years ago, I was running 4 CPU servers with 48 cores and 128GB of RAM in 2U enclosures with a maximum power consumption of 500W or so. I was able to stick ~20 of them in a 42U rack, totaling 10kW.

A data center full of these can be cooled with CRACs and hot/cold aisles without much problem. This is still too much for a bog-standard server colocation operation, but for HPC, that was normal and manageable.

Now, a ~1U server houses 4 SOTA NVIDIA GPUs, 64 cores, magnitudes more RAM. This server alone uses ~3KW of power. This means you go anywhere between 30kW to 50kW per rack, and you have many racks.

Of course this means more power comes in, more heat comes out. This means more sophisticated infrastructure: bigger and beefier primary and secondary power systems, beefier cooling, more heat, more noise, in short "more of everything".

Of course when you cram this much energy and heat into a relatively small space, its effect on the environment will be much more pronounced.

Facebook's previous SOTA datacenter used water infused, HEPA filtered free flowing air accross the datacenter. Now, it's server level direct liquid cooling with extensive water treatment and oversight on coolant parameters.

Compare this having a hand warmer vs. coal ember in your hand. The latter needs a much more elaborate setup to prevent it burning you badly.

reply

upvote

by butvacuum3 hours ago|

[-]

Why are you implying all datacenters are GPU farms? You can't retrofit that kind of power density into existing buildings.

You can stuff GPU servers into existing buildings- but even with significant upgrades you end up with a lot of empty space on the floor that can't be used.

reply

upvote

by bayindirh3 hours ago|

[-]

Two main reasons.

1. Article is about AI, so I have given the example for an AI datacenter.

2. In pure CPU datacenters, the power dynamics do not change much. I can add more servers to a single rack, but the rack power is again in the 30kW to 50kW range, so you're planning and building for the same power capacity.

> You can stuff GPU servers into existing buildings-

Yes.

> but even with significant upgrades you end up with a lot of empty space on the floor that can't be used.

Yes & No. It's not impossible to convert an old datacenter to support ~35KW/rack capacity, but it's not cheap, and you'll have more worries than holes, piping, building and power. Namely, can your floor handle that much weight to begin with?

reply

upvote

by Liftyee10 hours ago|

[-]

Though, the new data centers are not entirely the same. Increasing use of onsite gas turbines to generate power instead of using grid power changes their noise+air pollution profile.

reply

upvote

by butvacuum3 hours ago|

[-]

afaik, it's only the so called "portable" generators openAI used to contravene noise and pollution regulations.

reply

upvote

by fragmede8 hours ago|

[-]

The problem these days is lack of nuance. It should seem entirely reasonable to be pro-datacenters-if-they're-done-right, but it feels like there are only two sides to any issue. Gas turbine whine noise isn't coming from the data center, it's being used to power the data center, but the camp is either pro data center or not, and fuck any nuance.

reply

upvote

by ninjalanternshk6 hours ago|

[-]

The problem is people keep trying to regulate businesses by name instead of by the effects they have.

If we had regulations on noise, vibration, emissions, water use, electromagnetic radiation, whatever else, then it wouldn’t matter what people tried to build — if it fits within the guidelines great, otherwise back to the drawing board.

Putting “data center” in your ordinances is as lazy and ineffective as putting “abattoir.”

reply

upvote

by altcognito5 hours ago|

[-]

> If we had regulations on noise, vibration, emissions, water use, electromagnetic radiation, whatever else, then it wouldn’t matter what people tried to build

We certainly do! It’s just often overridden and ignored for these companies and data centers

reply

upvote

by quickthrowman3 hours ago|

[-]

> If we had regulations on noise, vibration, emissions, water use, electromagnetic radiation, whatever else, then it wouldn’t matter what people tried to build — if it fits within the guidelines great, otherwise back to the drawing board.

Sane jurisdictions do have regulations regarding these things. Not all jurisdictions are sane, some of them are run by people who sell out their residents.

Suburbs and cities around me all have noise regulations, my state has its own pollution regulations, and the local water utilities don’t hook up customers that stress the system. Unfortunately there are places like Texas, Tennessee, Louisiana, Mississippi that don’t give two shits about their citizens and let companies run temporary natural gas turbines permanently and all kinds of other nonsense.

reply

upvote

by __egb__6 hours ago|

[-]

Maybe the lack of nuance is due to learning, through decades of experience, that the assumption “it won’t be done right” can be baked in.

reply

upvote

by wongarsu6 hours ago|

[-]

So people have a decades-long expectation that local government will fail them?

This does sound plausible, but it's also pretty sad and not a sign of a healthy democracy

reply

upvote

by jfengel2 hours ago|

[-]

I'm hard pressed to think of anyone who believes that America has a healthy democracy. Even those most recently elected continually claim that democracy is under threat.

reply

upvote

by Forgeties796 hours ago|

[-]

Because the reality is while we all debate the nuances companies just do whatever they want, and it’s usually whatever offloads the most issues to the public because it saves them more money.

reply

upvote

by lnsru10 hours ago|

[-]

Sounds exactly like the stories with 5G cell towers. Almost no problems with GSM and then suddenly 5G is big issue.

reply

upvote

by ninjalanternshk6 hours ago|

[-]

> There are no meaningful regulations in building them

If a municipality doesn’t have emissions, noise, water use, etc regulations, that’s a serious failure in governance.

We don’t need nor want the word “data center” in regulations anymore than we need the word “abattoir.”

The names of the things we build change all the time. Their impact on their communities don’t.

We need to regulate impact, not the name or type of business.

If we did, nobody would know or care about data centers and they wouldn’t be affecting their communities, because they’d be operating under established impact regulations.

reply

upvote

by rpdillon6 hours ago|

[-]

How far do you live from a data center?

reply

upvote

by well_ackshually2 hours ago|

[-]

a constant low 60dB 20Hz hum in the background, 24/7 is as close as a a torture technique invented by the CIA as it can get.

reply

upvote

by heisenbit9 hours ago|

[-]

Plus the power needed for cooling adding maybe 50%.

reply

upvote

by jwpapi9 hours ago|

[-]

Interesting so it’s supply chain and then you need to calculate how long it can be utilized and for how much you can sell it.

Would love more calculations on that

reply

upvote

by Twirrim18 hours ago|

[-]

> I have never seen a company use AMD outside of wafer and a couple others mostly in US.

There's a few using them, and even more starting to experiment with them. AMD has long been a source of disappointment around this side of things, so I'm hesitant to feel optimistic we'll finally get some competition. The market really needs viable competition to Nvidia, especially performance/watt.

reply

upvote

by craftkiller18 hours ago|

[-]

> I have never seen a company use AMD

Meta is using AMD: https://www.amd.com/en/newsroom/press-releases/2026-2-24-amd...

And OpenAI: https://www.amd.com/en/newsroom/press-releases/2025-10-6-amd...

reply

upvote

by minraws3 hours ago|

[-]

OpenAI maybe, but a few friends in Meta said they don't so dunno man. Seems sus atm.

But it's meta they can get a GW up of AMD in a year

reply

upvote

by Schiendelman17 hours ago|

[-]

It's not clear when this will be - AMD has slipped these dates likely to 2027.

reply

upvote

by embedding-shape9 hours ago|

[-]

> I have never seen a company use AMD outside of wafer and a couple others mostly in US.

Worth remembering AMD basically "owns" (not literally) the hardware-side of things in video games consoles for good many years now, with no end in sight.

reply

upvote

by minraws3 hours ago|

[-]

I was talking in the data center gpu context, EPYCs are pretty common in data centers these days.

I have a huge EPYC based data center like 200-300+km from my house on the outskirts of the city a few dozen miles from a IT industry tech park(place with lots of IT company offices).

reply

upvote

by ekianjo8 hours ago|

[-]

Because they have x86 CPU licenses.

reply

upvote

by wongarsu6 hours ago|

[-]

Consoles used to all be custom architectures. If Intel was the only one doing x86 and AMD had offered the same price, performance and features as they do now, but in another architecture, my bet is that in that universe AMD would still have gotten the contract. Using x86 is a big deal to simplify things, but so is AMD's APU with unified memory between CPU and GPU (similar to what Apple now does with their silicon)

reply

upvote

by embedding-shape8 hours ago|

[-]

Every single video game console of the last generation (and probably further back) are using AMD Radeon for graphics too FWIW. I think the Switch might be the only outlier recently using nvidia graphics.

reply

upvote

by duped3 hours ago|

[-]

AMD invented x86_64

reply

upvote

by 7thpower12 hours ago|

[-]

Typically any company that can’t get Nvidia to fill their orders will have at least some AMD.

reply

upvote

by embedding-shape8 hours ago|

[-]

What type of company are you talking about here? Granted, nowadays I mostly interact with ML-adjacent companies but almost none would go "Hmm, hard to get nvidia hardware today, lets dump all expertise and knowledge of CUDA et al we have and start using AMD hardware until we can get nvidia", everyone would just wait or rent in the meantime.

reply

upvote

by wongarsu6 hours ago|

[-]

Inference workloads are usually a lot less picky about the exact hardware than model training. At least in the cases I know of the models are trained on Nvidia hardware, then exported and run on a mix of Nvidia and AMD

reply

upvote

by minraws3 hours ago|

[-]

At scale for inference it's almost non-existent for a data center company to go for AMD because they couldn't get or afford Nvidia atm.

They instead start the build out and plug in stuff they can, then take a loan or ask Nvidia to help fund it. (I am not joking)

I believe the case is if you can prove to Nvidia you can install and provide more Nvidia capacity they help out because more Capacity going online today is in the best interest of Nvidia.

Spot prices of Nvidia GPUs going up is not good news for Nvidia btw. The people renting Nvidia has the least amount of friction in moving off Nvidia, especially with AI tools you could build and get up to speed with AMD stack much sooner...

So if Nvidia is truly not an option and you entire company is not a bet on Nvidia then you will move off but only as a renter not as a buyer unless they truly can't fund Nvidia I suppose.

But again I repeat if you build a datacenter and provide good enough base Nvidia will help fund you to a mostly complete data center.

People might not like it but that's the reason Nvidia is so unreasonably dominant even now when otherwise given the scale of investments it might have been cheaper to look for alternatives.

This is why Nvidia doesn't like the China stack.

reply

upvote

by latchkey16 hours ago|

[-]

> I have never seen a company use AMD outside of wafer and a couple others mostly in US.

Just because you haven't seen it doesn't mean it doesn't exist.

We've serviced over 700 customers on our MI300x.

reply

upvote

by 14 hours ago|

[-]

deleted

reply

upvote

by jingpostmedia5 hours ago|

[-]

[flagged]

reply

upvote

by jingpostmedia8 hours ago|

[-]

[flagged]

reply

upvote

by 18 hours ago|

[-]

deleted

reply

upvote

by technoabsurdist18 hours ago|

[-]

AMD MI355X uses 1,400W per GPU and NVIDIA B200 uses 1,200W. So AMD uses about 16% more power.

reply

upvote

by vlovich12317 hours ago|

[-]

Not how you measure performance per watt but generally it’s 20-60% worse at tok/s/watt not 16. It does have ~50% more memory (~100gb) which complicates the comparison.

reply

upvote

by hassaanr14 hours ago|

[-]

While cool, quantization to FP4 is practically never lossless in actual use. A lot of providers are advertising high TPS on Kimi and GLM, but the models are functionally lobotomized and no longer close to frontier quality. Would love to see this not be true.

reply

upvote

by zozbot23411 hours ago|

[-]

Kimi uses INT4 as its native format, there's no such thing as "better than 4-bit precision" for that model. This is in contrast with GLM for which 16-bit precision is native and 8-bit is in common use.

reply

upvote

by hassaanr8 hours ago|

[-]

You’re right, but this poses a separate issue as the providers then do FP4 PTQ, which is quite lossy. Reduces the model size and optimizes for Blackwells at the (imo severe) cost of performance.

reply

upvote

by unrvl2212 hours ago|

[-]

MI355X can perform FP6 operations with the same speed as their FP4 (unique to AMD) - people should be making MXFP6 quants which would be pretty much lossless, and much closer to FP4 performance than FP8

reply

upvote

by Hugsun4 hours ago|

[-]

That can only be true if the workload is compute bound, not memory bandwidth bound.

reply

upvote

by minraws3 hours ago|

[-]

Doesn't Nvidia with their NVFP4 claim that it's lossless?

I haven't tested enough models Nvidia has converted to NVFP4 besides GLM 5.2 but it seemed fine to me.

My own luck has been hit or miss with it.

reply

upvote

by google23412313 hours ago|

[-]

First thing I noticed as well

reply

upvote

by tw198413 hours ago|

[-]

from memory, it is like 96-98% of the accuracy.

reply

upvote

by lgessler13 hours ago|

[-]

Accuracy isn't a meaningful metric here without reference to a specific task.

reply

upvote

by flawn9 hours ago|

[-]

Additionally, I'd imagine quantization to have more side-effects than just slightly lower performance (on whatever task). You are basically removing information, and that information could be by chance what the model needs to fulfill it exactly the way you'd want to do - although it's still fully capable. I am not sure if this is really different from "lower performance" but open to hear your opinions.

reply

upvote

by EduardoBautista12 hours ago|

[-]

And that 2%-4% makes all the difference.

reply

upvote

by fpaf10 hours ago|

[-]

Yes, it's like saying "we took off a big chunk of his brain but look! He can still breathe autonomously, swallow food and walk almost straight, which is like 95% of what he did before!"

reply

upvote

by nxtfari15 hours ago|

[-]

I think we should make it illegal to not specify the quantization in the headline for these types of posts.

reply

upvote

by ahmadyan15 hours ago|

[-]

Its MXFP4

reply

upvote

by IshKebab7 hours ago|

[-]

And to use the heading "Why this matters".

reply

upvote

by ozgrakkurt8 hours ago|

[-]

A nice filter is checking for the `.ai` in the end. It is very likely slop if you see that. Slop meaning low-effort/clickbait/shallow/useless/scam etc.

reply

upvote

by 484849495 hours ago|

[-]

triggered the grifters

reply

upvote

by mchusma2 hours ago|

[-]

I was hoping they would be discussing some path to improving things faster and cheaper. But in this post it looks like they offer quantized version for the same price as full version, and a fast version at much higher cost.

reply

upvote

by p1esk17 hours ago|

[-]

There’s noticeable accuracy degradation when they switched from fp8 to mxfp4

reply

upvote

by greyb14 hours ago|

[-]

Wafer discontinued their own "Wafer Pass" flagship coding plan within weeks of launch and had to issue prorated refunds. Now they're bragging about squeezing costs down even further via quantization, even though their implementation is clearly lacking.

[1] https://www.ycombinator.com/launches/Q9i-wafer-pass-flat-rat...

reply

upvote

by throwdbaaway17 hours ago|

[-]

And somehow they claimed that it is "lossless".

reply

upvote

by sometimelurker1 hours ago|

[-]

I like the metric of tok/joule a lot. it really brings to mind a lot of really nice ideas about energy and work and ideas and thought and efficiency

reply

upvote

by gcanyon2 hours ago|

[-]

Isn't this pretty much a given? Performance per dollar has to be a ratcheting function because how would something more expensive replace something less expensive?

reply

upvote

by ilaksh3 hours ago|

[-]

The compute-in-memory and neuromorphic paradigms are likely to push this much, much farther over the next decade as more radical improvements make it out of the lab. Sooner or later it will involve new materials and new nano devices and providing multiple orders of magnitude better efficiency. And just scaling up existing things like MRAM.

reply

upvote

by tim3336 hours ago|

[-]

Not a new phenomena - performance per dollar has been fairly steadily exponentialling since 1900 or so

1900 - 2010 https://www.thekurzweillibrary.com/exponential-growth-of-com...

1939 - 2023 https://medium.com/@timventura/kurzweils-law-for-the-ai-age-...

reply

upvote

by Schiendelman17 hours ago|

[-]

I'm not surprised to see competition with Blackwell. Rubin is 5x faster than Blackwell at inference - Blackwell is the last generation Nvidia didn't optimize specifically for inference.

If I'm missing something, please let me know!

reply

upvote

by boroboro414 hours ago|

[-]

It's very unclear what's special in Rubin to be optimized for inference? I can see disaggregated bit (with having separate prefill and decoding nodes), but what else?

reply

upvote

by villgax13 hours ago|

[-]

Lot more SMs & Tensor Cores for NVFP4 going by the looks of it.

reply

upvote

by nullc17 hours ago|

[-]

how do you get 5x faster at inference when inference is memory bandwidth limited? getting 5x the memory bandwidth of a h100 seems physically difficult.

reply

upvote

by Schiendelman16 hours ago|

[-]

Rubin has 22TB/s of memory bandwidth vs Blackwell's 8TB/s. NVLink 6 doubles interconnect speed. Plus they're moving to 3nm from ~4nm.

(Previously this comment said Rubin did native NVFP4, but Blackwell does too! Rubin just also trains with native NVFP4, which Blackwell does not.)

reply

upvote

by boredatoms16 hours ago|

[-]

Moving to lower bits is not a slam dunk, the model itself might degrade too much

reply

upvote

by Schiendelman15 hours ago|

[-]

Of course, but for most workflows it's fine.

reply

upvote

by zackangelo15 hours ago|

[-]

Blackwell supports nvfp4 natively.

reply

upvote

by Schiendelman15 hours ago|

[-]

You're right - Rubin is better at NVFP4 training, not inference, thank you for catching me!

reply

upvote

by boroboro414 hours ago|

[-]

What does it mean it's better at nvfp4 training? What's different between training and inference to make this true?

reply

upvote

by Schiendelman14 hours ago|

[-]

We're getting to the limit of my understanding, but I believe most Blackwell users still usually run FP8 passes through the transformer engine - they'll just store weights at NVFP4. Nvidia has model-specific stabilization recipes for NVFP4 end to end, but they're taking fixes all the time.

Nvidia says Rubin should have fewer stability problems training with FP4 because of hardware changes - "adaptive compression". There will still be outlier instability inherently, but something they're designing in reduces the cost of managing it.

But yeah, grain of salt - we haven't seen this in practice.

reply

upvote

by fc417fc80214 hours ago|

[-]

I'm also puzzled by that statement. The issue with training is (as I understand it) one of precision and the associated numerical stability. You need enough bits in order for backprop to function correctly.

Of course there are techniques such as quantization aware training but I don't understand why a datatype would work for inference but not for that.

You can also abandon backprop entirely but that comes with a whole host of tradeoffs and again why would it work for inference but not for whatever alternative training regime you selected?

reply

upvote

by unrvl2211 hours ago|

[-]

inference is only memory bandwidth limited when targeting higher tps / high single stream tps. the weights only need to be moved across once per forward pass, when you batch say 100 streams per forward pass (which is what most inference services do / care about) its compute bottlenecked.

reply

upvote

by AussieWog9318 hours ago|

[-]

The 2600 tok/s is an "aggregate", not the actual throughput.

reply

upvote

by technoabsurdist18 hours ago|

[-]

yes it is 213 tok/s single stream (so per user)

reply

upvote

by unrvl2212 hours ago|

[-]

that 213 wasn't achieved when saturated though. was probably more like 30 tps per stream when doing 2.6k tps.

reply

upvote

by 383629364818 hours ago|

[-]

So per subagent*.

reply

upvote

by alienbaby16 hours ago|

[-]

*per stream, I guess is more accurate than either?

reply

upvote

by conorcleary4 hours ago|

[-]

*especially as many currencies weaken

reply

upvote

by johanvts9 hours ago|

[-]

That sounds literally impossible.

reply

upvote

by dtgriscom4 hours ago|

[-]

Agreed. The writer is pretty loose with their comparisons:

* What does it mean for "performance per dollar" to get faster? Higher, maybe; rise faster than it has in the past, maybe, but just "faster"? Nope.

* The article cites some equipment as being "2x cheaper". I think they mean "half the cost", but if so they should say it.

reply

upvote

by oDot19 hours ago|

[-]

Do these providers have 80+% gross margins or is something eating into them? Maybe utilization?

reply

upvote

by technoabsurdist18 hours ago|

[-]

hi i work at wafer. no the margins are lower averaging at about ~40%. utilization is one of the highest order bits in determining margins here, yes.

reply

upvote

by keynha15 hours ago|

[-]

[dead]

reply

upvote

by adammarples7 hours ago|

[-]

Slight criticism of the headline there, you can't get cheaper per dollar.

reply

upvote

by hahahaa9 hours ago|

[-]

What is a knee, in performance talk?

reply

upvote

by kgwgk9 hours ago|

[-]

A place where the slope/derivative/incremental-performance-per-price changes.

reply

upvote

by nnevatie9 hours ago|

[-]

I used to be high-performance like you, then I took an arrow to the knee?

reply

upvote

by alienbaby16 hours ago|

[-]

I'm interested if anyone knows how much legwork the assumed 60% cache hit, plus running a quantised model is doing? Esp. compared to what the headline half implies is a full fat GLM5.2

reply

upvote

by ilaksh3 hours ago|

[-]

Can you actually rent an MI355X per hour anywhere right now?

reply

upvote

by killingtime7415 hours ago|

[-]

No word on what this actually means as a consumer. What's the price. Is it lower than NVIDIA serving?

reply

upvote

by mixtureoftakes13 hours ago|

[-]

They seem to be serving it at 3x the price while also struggling with maintaining uptime on openrouter; while the vercel router advertizes even bigger speeds but has no clear uptime stats

I guess you really do have to try it at least for some time to actually know

reply

upvote

by BurningFrog3 hours ago|

[-]

So... the headline is about performance per dollar per dollar?

reply

upvote

by beffjezos14 hours ago|

[-]

This is very interesting and yet not at the same time. This looks to be optimized for single-stream LLM traffic which is not viable to serve in a production setting. It's only interesting to hobbyists that want to run the model locally.

It's genuinely neat that AI can find the right optimization pathways in an AMD inference server to unlock this but at the same token (pun-intended) this is a classic case of benchmark hacking that doesn't stand up to real-world application.

reply

upvote

by wmf14 hours ago|

[-]

You got it backwards; it's ~200 on single stream so the 2,600 is achieved with ~13 streams.

reply

upvote

by beffjezos14 hours ago|

[-]

Yeah that makes sense. I'm more familiar with seeing tok/s/user + TTFT rather than the total node throughput.

reply

upvote

by technoabsurdist14 hours ago|

[-]

hi yes it’s not optimized for single stream it’s optimized for total node throughput

reply

upvote

by beffjezos14 hours ago|

[-]

Oh, that's much better then. A good metric to share is the tokens per second per user for the node rather than the total throughput of the node. It disambiguates what's being optimized for much better than your blog post currently does.

reply

upvote

by technoabsurdist12 hours ago|

[-]

sounds good feedback taken, thanks beffjezos

reply

upvote

by gowthamsaiyadav8 hours ago|

[-]

world is not limited by Nvidia, AMD can be used

reply

upvote

by calin2k11 hours ago|

[-]

then why is token per dollar getting more expensive?

reply

upvote

by ilaksh3 hours ago|

[-]

There are a limited number of these available in comparison to demand. I think people figured out that LLMs and VLMs can do real work that can replace a lot of humans. And for plenty of jobs, it's good enough to reduce already outsourced staff by 75-90% at a fraction of the cost.

reply

upvote

by FeepingCreature8 hours ago|

[-]

Because lots of people are willing to pay more dollar for smarter token.

reply

upvote

by AtlasBarfed11 hours ago|

[-]

Because they are dumping/subsidizing it token processing to try and get companies to fire as many people as possible. So they'll be dependent upon the companies when they have to Jack the rates

reply

upvote

by 7 hours ago|

[-]

deleted

reply

upvote

by yieldcrv18 hours ago|

[-]

Agentic coding drivers for different architectures is a massive unlock for the world

So much compute is under utilized waiting for a savant or company to prioritize an architecture, and now all the other engineers can tackle this at any time if they get inspired on the right prompts

reply

upvote

by technoabsurdist18 hours ago|

[-]

this is exactly our thesis at wafer :) thank you for the support

reply

upvote

by yieldcrv14 hours ago|

[-]

well done

reply

upvote

by yogthos18 hours ago|

[-]

Personally, I can't wait till something like this starts getting to consumer level. https://www.anuragk.com/blog/posts/Taalas.html

reply

upvote

by yieldcrv17 hours ago|

[-]

That’s pretty fascinating, Apple has some innocuous LLMs and transformers baked into its devices and leveraging their neural chipset

So I could see something like this where the neural chipset has an LLM that cant be so easily updated baked into it, until you get a new device

reply

upvote

by yogthos4 hours ago|

[-]

Exactly, it'd be the same as regular chip designed evolving. You get a specific model version baked into the chip, if it does what you need then it's fine. If you need more capability in the future, you just buy a new chip.

I also think the dynamic would be really different if model inference can run at ridiculous speeds. You could make a genetic algorithm loop around it, so it can generate a population of proposals at each step, then have those tested and whittled down iteratively. If inference happens at thousands of tokens per second, then from user perspective it would still be really fast, and even a small model could solve complex problems.

reply

upvote

by innis22613 hours ago|

[-]

[dead]

reply

upvote

by zuzululu12 hours ago|

[-]

yeah but we are still far far away from being able to run the frontier model equivalents locally without significant quantization

even having something like opus 4.8 locally would completely change the landscape

reply

upvote

by villgax13 hours ago|

[-]

They fail to mention non speculative numbers & whether baseline was nvfp4 as well. So much for erosion against an older gen

reply

upvote

by bitwize10 hours ago|

[-]

(in a high-pitched, pathetic regency-era British orphan voice) Please sir, may I have some compute as well?

reply

upvote

by paulreaney1 hours ago|

[-]

[dead]

reply

upvote

by servola1 hours ago|

[-]

[flagged]

reply

upvote

by pullrun5 hours ago|

[-]

[flagged]

reply

upvote

by jessinra985 hours ago|

[-]

[flagged]

reply

upvote

by shevy-java10 hours ago|

[-]

But RAM prices skyrocketed!

The AI companies owe use money. As does e. g. NVIDIA for becoming a cartel.

reply