AMD Strix Halo RDMA Cluster Setup Guide

upvote

AMD Strix Halo RDMA Cluster Setup Guide

(github.com)

212 points

by jakogut17 hours ago |

upvote

by pixelpoet15 hours ago|

[-]

I have two 128gb Strix Halos and have been extremely excited about Antirez's (Redis author) work on DS4, especially with 4bit quant using two machines: https://github.com/antirez/ds4

Right now the speed isn't good for GLM 5.2, Deepseek V4 Flash speed is okay for me (actually reading the output) and quite usable. See kyuz0's great recent video here: https://www.youtube.com/watch?v=PkKXm_mKCCM

With a bit more speed and model improvements, local AI becomes a reasonable practical thing! The biggest problem is all the tech companies making consumer hardware completely unaffordable, and I don't think this is accidental. Look at Micron's profits and share price lately...

I got my Strix machines for ~2k eur each, best computers this 90s kid has ever owned, but those days are gone :(

reply

upvote

by sspiff10 hours ago|

[-]

I had a Strix Halo laptop with 128GB which unfortunately died last week. I paid 2800 euro for it. If I buy the same machine today, the sticker price is 7899.

The device was not perfect by any means, but the ability to run fairly large models is some kind of magic.

reply

upvote

by barbacoa9 hours ago|

[-]

>sticker price is 7899.

It's not even worth it at that point.

You can get a used enterprise grade SXM baseboard with 4-8 V100/A100 GPUs off eBay at a similar price. That will even get you actual HMB ram and NVlink. Along with 10x the AI performance, assuming you don't care about your electricity bill of course.

reply

upvote

by sspiff9 minutes ago|

[-]

Yeah it isn't worth it, but comparing a server with a laptop is also not a relevant comparison.

I didn't get a Strix Halo laptop because it was the best bang per buck, I got it because it was an awesome machine that could do a little bit of everything, fit in a backpack and only needed 140W.

But noone should buy one at 7899, obviously. It was a tough sell for me at the old 2800 pricing.

reply

upvote

by somewhatrandom93 hours ago|

[-]

You can get a new M5 Max MacBook pro with 128 GB unified ram (targeted by Antirez for DwarfStar4) even after the Apple price increases, it's less than 7899 by at least $1000. And you probably won't pull more than 100 Watts.

reply

upvote

by sspiff11 minutes ago|

[-]

You are comparing US pricing with EU pricing. EU pricing includes 21% VAT and currency conversion "rounding up".

The cheapest 128GB Macbook Pro here costs €7.949,00.

No doubt a better value than the HP, and will depreciate a lot less quickly, but just as expensive. Unfortunately, not being able to run Linux is a breaking point for me.

reply

upvote

by dehugger1 hours ago|

[-]

I have one of these. Got it a few weeks before the price increases. On the 14" version charging is limited to 96 watts, but the chip can pull north of that with adequate cooling, so the battery will literally drain while plugged in.

It isn't a problem for me, more amusing than anything else (I run in Low Power mode 90% of the time) but worth knowing for anyone thats thinking about pushing the hardware to its limit 24/7.

reply

upvote

by Keyframe7 hours ago|

[-]

ignoring the fact one would need a bit of a different setup (chassis, PSU) to run it, I casually looked and there's nothing below $25-50k euros for such a board decked out, depending on the config. TBH even that doesn't sound bad, but I wouldn't even know where to start how to run it.

reply

upvote

by barbacoa5 hours ago|

[-]

This is from the USA site:

https://www.ebay.com/itm/157742745616

It's always a gamble buying used electric but per the description, fully decked out server with 256gb vram.

Though at 2U it is going to sound like a 747 taking off from your office.

reply

upvote

by gessha4 hours ago|

[-]

So that’s where all the used V100s are going. I’ve been thinking about making a water cooled version of those because I don’t have a rack to put these servers on.

reply

upvote

by skinfaxi6 hours ago|

[-]

Can you link because 8k is the price of two A100s so that would be a steal.

reply

upvote

by barbacoa5 hours ago|

[-]

4x A100s with the baseboard for $9k (USD).

https://www.ebay.com/itm/336632412718

Buy one and let me use it when you're sleeping.

reply

upvote

by Keyframe7 hours ago|

[-]

everything strix halo went 2-3x bananas, same ballpark figures as apple hardware now and lead times on all of those are in months. Ridiculous where we ended up at.

reply

upvote

by rjzzleep11 hours ago|

[-]

Last year you could buy a AI Max 395+ with 128G for 2.5k, now it's almost $4k.

Or maybe you're right, I originally remembered 2k as well. I wanted to wait for the AI Max 395+ upgrade of my laptop, and now it makes no sense to upgrade.

reply

upvote

by stymaar11 hours ago|

[-]

> Last year you could buy a AI Max 395+ with 128G for 2.5k, now it's almost $4

Only if you pay the Framework premium.

https://www.bosgamepc.com/products/bosgame-m5-ai-mini-deskto...

I don't have access to the USD price, but it's 2500€ (tax included), up from 1600€ in November when I ordered mine.

reply

upvote

by pixelpoet11 hours ago|

[-]

I think people buying laptops for AI use are, sorry, just plain crazy. You overpay for the screen and keyboard and battery and whatever, plus you get much worse thermal performance because of basic physics (area vs volume). My Framework Desktop has a Noctua cooler which works really well.

[Tangent: all my life I've been downvoted into a smoking hole in the ground, particularly on reddit r/hardware, for questioning the wisdom of laptops for high performance computing, including gaming. Everyone insists they need the mobility, and then just leave it plugged in the whole time, absolutely refusing to admit it's about aesthetic preference.]

reply

upvote

by Gareth32110 hours ago|

[-]

I’m mostly with you but there are some people who like to use one machine for both laptop and AI work, and it’s much cheaper than buying two separate devices.

reply

upvote

by kamranjon10 hours ago|

[-]

I generally agree for everything except Macbook Pros which outperform most available desktop setups for AI tasks - but they are also now out of reach for most people after the price hikes (6.7k now for 128gb, i got mine for 4.7k just about a year ago).

Honestly I think this is just a bad time to be buying hardware - everything is marked up an insane amount that doesn't really make sense.

reply

upvote

by 7 hours ago|

[-]

deleted

reply

upvote

by rzzzt10 hours ago|

[-]

For me the smaller footprint, lower power consumption and portability (admittedly between desks only) are the three advantages of using a laptop over a desktop for these purposes.

reply

upvote

by pixelpoet10 hours ago|

[-]

The Strix Halo mini PCs use the exact same chip, and have a much smaller footprint than any laptop. Have you seen the size of these machines? I can and have easily popped my daily driver computer into my very small backpack to attend a demoparty for example.

With the laptop you probably won't get silent operation at the peak 100-140w, i.e. you've now massively overpaid for lower performance.

reply

upvote

by throwa3562629 hours ago|

[-]

Can you get these from vendors like Asus and lenovo these days?

The ones I've seen on aliexpress are from unknown Chinese vendors.

reply

upvote

by pixelpoet9 hours ago|

[-]

I have a Framework Desktop as primary PC (great cooling, beautiful case with handle) and the Bosgame M5 dedicated for AI use.

I was also a bit wary about Bosgame but TBH they've been great and the machine is rock solid, if a little noisier than and not as pretty as the FD. You can just buy from them directly and be fine, best computer deal out there by a mile.

reply

upvote

by justsomehnguy6 hours ago|

[-]

Maybe there is a reason?

It's like you are advocating for a public transport instead of a personal car but when questioned how to get to a place which is not erviced by a public network your solution is to rent a bus.

reply

upvote

by Tepix10 hours ago|

[-]

The cheapest ones with 128GB were 1580€/$1840 as late as mid December.

reply

upvote

by Gareth32110 hours ago|

[-]

I was hoping to buy a competent local model machine later this year but given the prices I’m shelving that for now. Especially because the frontier models are very cheap relative to the cost of building my own setup. Especially because AI specialised hardware and processors are improving very fast, meaning hardware we buy now will become obsolete for this use case much faster than for traditional computer use cases.

In 1-3 years the hardware crunch will be over, local distilled models will provide Opus 4.8 like intelligence, and the hardware will exist to provide usable performance.

reply

upvote

by rnewme14 hours ago|

[-]

What's the advantage of ds4 over llama.cpp, esp if down the line they upstream his forked kernels?

reply

upvote

by pixelpoet11 hours ago|

[-]

IIRC llama.cpp doesn't implement DSv4's compressed attention mechanism, and while it does use (credited) parts of llama.cpp, it's focused on this great model for now. Much of this is covered better in the repo's readme.

reply

upvote

by rnewme8 hours ago|

[-]

In repo Readme and antirez reddit comments there was also expressed willingness to upstream.

reply

upvote

by mkesper8 hours ago|

[-]

Currently, llama.cpp clusters don't support tensor parallelism, have a look at Donato Capitella's detailed report: https://m.youtube.com/watch?v=PkKXm_mKCCM He also provides rocm toolboxes for Strix Halo: https://strix-halo-toolboxes.com/#about

reply

upvote

by francisduvivier11 hours ago|

[-]

I think mainly that he can move much faster with specific improvements targeting Deepseek on Systems with unified memory (Mac or Strix). It's a lot easier to optimize if you don't need to worry about all the other architectures. So optimize he did and it's just a lot faster than llama cpp for deepseek v4 pro and flash. Also interesting features are more doable, like SSD streaming, which makes it possible to load MOE weights for a model larger than your VRAM, I don't see that landing in llama cpp anytime soon.

reply

upvote

by gruez14 hours ago|

[-]

>The biggest problem is all the tech companies making consumer hardware completely unaffordable, and I don't think this is accidental. Look at Micron's profits and share price lately...

You realize "tech companies" isn't a monolith? Micron charging inflated prices doesn't magically benefit OpenAI. The "high prices keep out competitors" theory doesn't make much sense either. It's like saying Dennys benefits from higher egg prices because it makes cooking eggs at home more expensive.

reply

upvote

by omgwtfbyobbq57 minutes ago|

[-]

Dennys can benefit from higher egg prices if they can lock in long term contracts with suppliers for lower egg prices when smaller companies selling directly to consumers can't.

I think that realistically, companies compete against each other as individuals and compete against smaller companies and individuals acting more like cartels/monopolies, and that's what OP is referring to in terms of hardware purchasing/contacts/pricing. This also extends outside of tech to investing, so it's likely not just tech responsible for this.

reply

upvote

by sdf4j13 hours ago|

[-]

You got it wrong. Use appliances instead of eggs. If getting an oven gets more expensive I rather keep going to Dennys.

It’s classic capex vs opex. I’d keep paying my openai subscription instead of dropping $3k to run a subpar model. If the thing costs $1k I would consider it.

reply

upvote

by codingrightnow4 hours ago|

[-]

The big AI companies already have all of the cheap hardware to run on. It's more like if Denny's bought up all of the eggs when prices were normal, held onto them, then kept buying to keep egg prices up.

reply

upvote

by gruez4 hours ago|

[-]

>The big AI companies already have all of the cheap hardware to run on.

Have they? Aren't they doing a massive datacenter build-out right now? Moreover the massive profits for Micron and Nvidia must be coming from somewhere, and I doubt it's price-sensitive consumers.

reply

upvote

by mkj13 hours ago|

[-]

openai etc are going to have a higher utilisation of the hardware so can afford it more than small companies/people. Efficient resource use matters more when they're expensive.

reply

upvote

by jcastro16 hours ago|

[-]

This is amazing!

I'm working on a three node strix halo agentic OS factory designed to be maintained by local agents: https://github.com/projectbluefin/testing-lab

This memory bandwidth combo is amazing for homelabbers. kyuz0's work on these containers has made the investment in this kit so valuable I hope Framework is sending you hardware!

https://projectbluefin.io/server/ is what I'm hoping to ship, designed to just ship setups like this ootb and things like this would be so much harder without kyuz0!

(Note: The 64GB ones are going for $1700-ish empty, the prices on the 128's are outrageous we can just keep making the labs more deterministic over time!)

reply

upvote

by mestadler13 hours ago|

[-]

Yep, nice write up, seems we are all doing this. Its as close as you can get to Provider level for essentially prosumer hardware. I'll share what I've got with this running under k0s and the npu work.

reply

upvote

by sdlkj-12 hours ago|

[-]

This is amazing work - RDMA on these smaller unified memory boxes (somewhat) bridges the gap for consumers from the ~24GB 3090/4090/7900 card that are around to 128GB/256GB! Still not cheap, especially now, but... obtainable?

I do hope that apple opens up RDMA for their TB4 machines... ds4 using TB5 macs works great - but there are a lot of capable tb4 (M2/1) machines out there and afaik there's no hardware limitation preventing RDMA from working (at lower bandwidth, but with the latency gains!) on the older stuff.

reply

upvote

by kamranjon11 hours ago|

[-]

Benchmarks are here: https://kyuz0.github.io/amd-strix-halo-vllm-toolboxes/

Would love to see DeepSeek V4 flash/pro and MiniMax M3 benchmarks but already these are pretty impressive, first strix Halo setup I've seen with some serious performance.

EDIT: Apologies - I think I misunderstood these benchmarks - it seems this is actually very slow when compared to a M4 or M5 chip with a good amount of memory. Looking at the creators video here: https://youtu.be/Cfl3TS7ME5s?t=734 -- it seems the performance of strix halo is much much slower than I get on my M4 MBP - which gets ~400 prefill and ~20 tok/s generation

reply

upvote

by 3abiton7 hours ago|

[-]

They are heavily bogged down by bandwidth unfortunately. The macs are on another level. If Apple decides to release AI dedicated hardware, it would dominate this space (consumer AI).

reply

upvote

by Tepix10 hours ago|

[-]

The pp speeds are really slow (50), I think there‘s room for improvement still.

reply

upvote

by kamranjon10 hours ago|

[-]

Ah yea after watching one of the creators youtube videos I realize these benchmarks are combining prefill and decode which isn't super helpful - it seems this struggles with the exact same bottlenecks as all strix halo setups, memory bandwidth. It seems this is still significantly slower than equivalent memory sizing on Mac hardware.

reply

upvote

by fulafel8 hours ago|

[-]

How are the memory bandwidths specs of Macbooks vs this?

reply

upvote

by fulafel1 hours ago|

[-]

I looked it up: 512 GB/s for the two node AMD cluster, Macbook Pro with M5 CPU has 153 GB/s. But you can get faster Macs with M5 Pro or M5 Max.

reply

upvote

by Lwerewolf7 hours ago|

[-]

The apple silicon chips basically beat everything in bandwidth. Highest amount of memory controllers (i.e. channels) for a given capacity. That's the main party trick.

reply

upvote

by MayeulC5 hours ago|

[-]

Hmm, coing PCIe -> NIC -> NIC -> PCIe seems a bit silly, couldn't both devices communicate directly over PCIe?

reply

upvote

by wmf2 hours ago|

[-]

You need Non-Transparent Bridging (NTB) for that; I don't know if AMD has it.

reply

upvote

by Tepix12 hours ago|

[-]

What‘s the advantage of using ConnectX-5 Ex VPI NICs instead of much cheaper ConnectX-3 VPI NICs to connect two machines directly, other than PCIe 4.0 instead of PCIe 3.0? Can they offload more tasks when doing RDMA? Solid information is hard to come by.

reply

upvote

by justincormack7 hours ago|

[-]

The machine only has pcie4x4 so 50Gb bandwidth, pcie3 would halve that to 25Gb

Thats the problem with these AMD laptop class cores, they have very little IO. They have been saying they will release in a desktop form factor, but then it probably wont have such good memory bandwidth...

The Nvidia boxes have 200Gb ethernet thats much more useful for clustering.

reply

upvote

by olavgg7 hours ago|

[-]

Yes CX5 can offload more. I believe CX4 has similar offloading capabilities as CX3, except that it supports 100G.

Another note: In my experience, RoCE works much better on CX4+ generation. CX3 is best with Infiniband. I think some firmwares on the CX3 generation, has a messed up config for RoCE. But running Infiniband is not a complex task, is way easier than people think, like 10x easier and faster to setup than Ethernet.

reply

upvote

by mestadler13 hours ago|

[-]

This is exactly the type of technical depth that makes a difference. I've been following all the work you have been doing.

reply

upvote

by jmyeet13 hours ago|

[-]

So this is kind of fascinating. The main hardware costs here seem to be:

- 2x Framework Desktop AI Mainboards with 128GB of RAM for $3150 each

- 2x 100G Ethernet controllers for ~$500 each

So the Framework board has a single PCI-e 4.0 x4 slot, which amounts to 8GB/s or 64Gbps theoretical so you're not getting 100G. Also, the 100G cards all seem to be PCI-e x16 slots for obvious reasons so you need a riser or an adapter or something to even get them to work.

I don't know how hot a 100GbE copper NIC runs but, from experience, 10GbE NICs have been basically giant heatsinks, basically. So fiber might be advisable and I expect short fiber cables here probably aren't cost-prohibitive given everything else.

As an aside, if you are using Ethernet for clustering and you're clustering 2 devices, in an ideal world you'd be using simplex Ethernet but that's not an option here.

I wonder if the author considered USB 4.0 for clustering? I ask because I know people who have clustered Mac Studios over TB5 and that bandwidth is up to 120Gbps. The version of USB4 on the Ryzen AI 395 seems to be 40Gbps, which isn't that far off 8GB/s over PCI-e 4.0 x4.

But the limiting factor with Strix Halo (and DGX Spark for that matter) is memory bandwidth, both under 300GB/s. The obvious comparison is to the Mac Studio. Unfortunately the largest spec they currently sell is 96GB. It had been as high as 512GB. And 96GB is $6700+ but you're also getting way better performance AFAICT eg [1]. The M3 Ultra has ~900GB/s memory bandwidth.

You can alternatively buy a Macbook Pro with M5 Max and 128GB of RAM (now $8000, was $5500-6000 a few days ago) but that tops out at ~600GB/s, which is still double these mini AI boxes.

Oh and if you don't want to go the way of these Framework motherboards, you can buy a whole 128GB Strix Halo PC for $3k or less.

I think the main point here though is we're only a few years away from running 300B+ (or even 1T+) param models at useful speeds on enthusiast hardware.

[1]: https://www.reddit.com/r/LocalLLaMA/comments/1u5mfaq/you_can...

reply

upvote

by kcb13 hours ago|

[-]

No reason to use fiber on short runs like that. DAC cables are cheap and better in pretty much every way over short distances. You're probably thinking of RJ-45 NICs and SFP modules which are known to run pretty hot.

reply

upvote

by layla5alive12 hours ago|

[-]

+1 fiber over short distance just adds power/heat and latency compared to DAC - fiber is nice for ease of cabling and airflow, but not performance or cost when below a few meters.

reply

upvote

by PeterStuer9 hours ago|

[-]

The 512GB Mac Studio is going for around 30K used.

reply

upvote

by MisterKent12 hours ago|

[-]

I ran Ms-01s with 100GBE, copper DACs in my kubernetes cluster. Killed the NVME drives in that tiny box. I'd bet the same issue doing this with FW. And I wasn't even pushing 100GBE very hard at all, it was mostly for fun.

AI + 100GBE (under load) + tiny box = unreliable and eead very quickly.

reply

upvote

by angled11 hours ago|

[-]

How many MS-01s did you have clustered?

And could you not use something like an N5 + iSCSI for storage?

reply

upvote

by 13 hours ago|

[-]

deleted

reply

upvote

by mestadler13 hours ago|

[-]

He did cover the Tb/USB4 ;)

reply

upvote

by gregoryl13 hours ago|

[-]

Indeed, here: https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/...

reply

upvote

by erik10 hours ago|

[-]

Looks like there is currently no RDMA support for thunderbolt, so it's a much higher latency connection. Apple has RDMA over thunderbolt working, so I wonder if it's possible on Strix Halo.

reply

upvote

by Hikikomori10 hours ago|

[-]

What is simplex ethernet?

reply

upvote

by jmyeet8 hours ago|

[-]

Imagine two computers A and B. A has two NICs, A1 and A2. B has B1 and B2. So 4 NICs total. You connect directly A1 to B1 and A2 to B2 with crossover cables. You then route all the traffic from A to B over A1 to B1 and all the traffic from B to A over B2 to A2.

Why do you do all this? To avoid collisions and the loss of effective bandwidth from back-offs.

It only really works with 2 computers because if you add a 3rd, now you need 12 NICs instead of 4 for unidirectional point-to-point connections.

reply

upvote

by Tuna-Fish4 hours ago|

[-]

That's not how modern ethernet works at all. A single NIC talking directly to an another one has no collisions ever. Depending on what your channel is, either you have separate wires for the directions, or you are using a hybrid circuit (as in telegraphs, the term is so overloaded it's hard to google). Either way, packets going in one direction never wait for packets going in the other.

reply

upvote

by Hikikomori7 hours ago|

[-]

But why would you? You don't have collisions since the introduction of full duplex ethernet on both copper/fiber. Kinda sounds like you're confusing half duplex with simplex, or maybe bidi? As a network engineer I've never seen someone ever refer to "simplex ethernet".

reply