NVFP4 at reasonable speeds (~120 tok/s) and concurrency is possible at a $80/90k figure with today's prices, maybe even less. That buys you 6 RTX 6000 PRO Blackwells, a decent CPU and motherboard, power supply. 576gb of VRAM.
You could do it for under $50k if you're OK with 40 tok/s decode, ~1200 tok/s prefill.
Official price 85k...
In late 2027 or early 2028, Nvidia will release Vera Rubin DGX Spark, likely with double or better the performance of current Blackwell, though unclear if memory capacity will go up much from current 128GB. Two to four of those will run models like this decently.
In 2028 we should expect Vera Rubin RTX discrete lineup, including the replacement to the RTX PRO 6000. Likely memory spec will be minimum 128GB. Good chance of up to 200GB. Two to four of those will run NVFP4 models in this class very well.
The problem is the backplane I have not managed to find a single baseboard, and getting a random baseboard to work with random modules is probably a crap shoot.
I’ve found that I need to go a couple steps past whatever quantizations are good enough in the KL-divergence testing to get good performance in real tasks with long context. So when Q4 is claimed to be lossless I end up with Q5 or Q6 for actual long-context tasks.
I feel like part of the reason for the relative stagnation in hardware over the last twenty years was simply the lack of use cases to justify hardware refreshes by businesses.
Most of the money and energy went to mobile for the last fifteen years.
Affordable local inference might be the gravy train the server, desktop, and laptop manufacturers need to get back in gear.
Business hardware got beefier because businesses demanded more data (or more specifically: the industry told businesses they needed more data), with no idea of what to actually do with it once they got it. To get all that data, bandwidth needed to be increased, with more iops to read/write it, more storage to keep it, and more memory and cpu to process it. But 99% of the data is junk. Companies have "data lakes" so big they need to come up with excuses to use the data, or risk somebody pointing out that they're spending a fortune hoarding bits.
Consumer hardware hasn't had a new use case since like 2012. Faster wifi for broadband & local file transfers, and higher-resolution video, are the only reasons one needed newer hardware. We actually got a resolution so high it makes no perceivable difference. And yeah we got faster CPUs and memory, but as soon as we did it got all eaten up by the most inefficient, wasteful software conceivable. Same use cases as 13 years ago, just more expensive, harder to use, and buggier. We should've gotten a new CPU architecture that was faster and more energy efficient. Finally it was delivered, but with a moat around the golden Apple.
Here we are two and a half decades into the Internet era, and my damn bluetooth earbuds and webcam microphone don't work half the time that I open a video conferencing app. Hardware can stay exactly like it is for the next few decades and I'd be happy. I just want software that works, and doesn't get continuously slower, forcing me to buy bigger hardware; or more draconian, locking me out of being able to use it how I want.
No, we're running into limits of moore's law, and it's showing in prices for new nodes, where they're getting denser but not cheaper.
So we hit limits on clock speed in the early 2000s (ex - the 4ghz wall) but it also turned out that mobile as the driver for sales meant no one really cared much about clock speed compared to performance/watt.
Clock speed mattered, but only relative to how many watts it took to get it (and above 4ghz... too many watts).
But we've seen a 15x improvement over the last 20 years. Performance/Watt is WAY up.
My guess is that LLMs are going to drive another "improvement cycle" in areas that we didn't care much about before.
I've built about 10 personal desktop machines (1 every ~4 years) and I can honestly say that I didn't care much about memory bandwidth prior to 2021.
In the same way that I didn't care much about how many watts my pentium 4 was using in 2005.
But now... now I care a lot about memory bandwidth. I care about memory speeds and total system ram in a manner I really, really didn't before.
So I think we're going to see a big shift to machines built on unified ram with a crazy focus on squeezing memory bandwidth and total ram capacity as far as we can.
My bet is that we'll get a similar 10-15x improvement by 2040 in unified system ram designs.
I fully expect to see 2tb unified ram desktops and 200gb unified ram phones be relatively common on a 20 year timeline, assuming we see similar levels of geopolitical stability (ex - world war 3 throws a wrench into things).
In the old days, Microsoft Entertainment Pack games were somewhat visibly taxing on some lower end systems.
The page advertises the 8-bit quant as taking ~800GB, which seems like it would require at least 3 consumer motherboards fully stacked w/ 4x64GB cards each.
Maybe “locally” has slowly come to imply “…on your homelab”?
I was lucky to buy a lot of RAM before prices skyrocketed. I knew I wanted to play with this stuff, so I spent what felt like a lot of money at the time to buy 8x96GB DDR5-6400 RDIMMs. Now the same RAM costs at least 6x more.
It wasn’t that absurdly expensive for a hobby, I bought 64GB DDR4 ECC sticks between $70-$100 on eBay before everything took off. Now everyone is in here debating if open source is 1 month or 3 months behind SOTA. The future is obviously local.
Speed-wise, I don't have numbers, but it feels subjectively faster than Opus in Claude Code. YMMV.
Once you go above "a used 3090 at a decentish price", then I strongly recommend renting cloud GPUs or at least testing models using paid APIs. This allows testing your use case before spending piles of money.
Each token has to read all the active weights. I think that's around 40B parameters active. At a 4-bit quant that's 20GB. With 100GB/s (replace with whatever your bandwidth is) and you get 5 tokens per second.