undefined

points

[-]

> usefulness of the RTX Spark

Not really. There's a reason the announcement didn't include ANY benchmark (!) and didn't mention EXACTLY what is the memory bandwidth. It's going to be dog-slow unusable for large models, as tok/sec is basically bandwidth divided by active weights. Rumoured 300GB/s / 30GB active weights (decent model) = 10 tokens per second, which is really slow

by SwellJoe18 hours ago|

parent|

[-]

Yep, I have a Strix Halo and while it can run models bigger than Qwen 3.6 27b, it's not usable interactively when you do. ds4 patched for ROCm works, but at such a slow speed, it's not usable for coding agents.

The Nvidia boxes have only slightly more memory bandwidth, so I wouldn't expect them to be notably faster. At least not enough to make it useful interactively at that scale.

by zozbot23418 hours ago|

parent|

[-]

Why does everyone expect interactivity from local AI? It's not the best use of the hardware, especially not miniPC hardware. Long-term batched inference with larger and more capable models is much more feasible AIUI.

by int_19h15 hours ago|

parent|

[-]

I can't speak for others but IMO the only reason to run models locally right now is privacy - i.e. you don't trust any of the cloud providers to not look at your prompts. Price-wise the market is extremely competitive and cheap model serving favors large scale so anything that can be run locally can be run cheaper in the cloud. But if privacy is important, then it's important for everything, including traditional chatbot applications, which kinda do require interactivity.

by SwellJoe18 hours ago|

parent|

prev|

[-]

Even batched it's uncomfortably slow. I started to benchmark ds4 with my security vulnerability benchmark (after Qwen 3.6 dense and MoE and a bunch of cloud models), but it was going to tie up the Strix Halo for more than a day, so I decided not to run it as it would prevent me from doing other stuff with it during that time.

Even batched usage needs to be fast enough to deliver results in a reasonable time. Overnight runs are useful, 24 hour runs are...less so.

Anyway, most of the time people are talking about interactive use, and there's currently an upper bound on how large a model can be for local hosting on a reasonable budget (i.e. not a crazy amount more expensive than what a high end developer desktop or laptop costs). The sweet spot is probably currently the big Qwen 3.6 or Gemma 4 models, which are in the ~60GB range for 8-bit quantization plus a large context.

by hedgehog17 hours ago|

parent|

[-]

The 6-bit versions + 8-bit KV cache seems to save a good bit of memory without a significant loss of quality. The Qwen 35B is pretty fast in my testing, but MiniMax M2.7 230B is in some ways faster (way fewer tokens to arrive at an answer) even though it is much larger.

by SwellJoe16 hours ago|

parent|

[-]

Qwen 3.6 35B-A3B with MTP at 8 bits is blazing fast, something like 50-60 tokens per second. That's plenty fast for interactive use, so I haven't tried lower bits. Unfortunately the MoE is notably dumber than the dense model (for the case I have data about...I've been benchmarking models for security vulnerability scanning, and 27B is notably better on hard bugs).

The dense model is almost usable, but feels really sluggish, even with MTP. I think it's about 12-15 tokens/second on the Strix Halo. Slow enough to where I'd rather pay to use a cloud model.

I might try the 6-bit version of the dense model to see how it behaves, though. Maybe it'll retain its bug hunting abilities while making it fast enough to use interactively and not take all day for benchmark runs.

by hedgehog15 hours ago|

parent|

[-]

Same chip, with a 6 bit 35B and 8 bit KV cache I see about 500 prefill and 55 decode at 30k into the context window. MiniMax seemed a bit lower token rate but much, much less prone to 40k tokens of monologue before generating an answer. A pattern I like is to use a smaller model to do most execution and then a larger model to review transcripts and output and do any fixups and tooling improvements (this is all batch jobs so all I care about is overall throughput).

by milch13 hours ago|

parent|

prev|

[-]

What hardware do you need to run MiniMax M2.7 230B locally?

by hedgehog10 hours ago|

parent|

[-]

Ryzen 395 is what I'm using, anything with 128GB+ of RAM accessible to the GPU should work fine for a 4 bit version of the model (so Spark or Mac Studio should be ok too).

by dirkg13 hours ago|

prev|

[-]

The RTX/DGX Spark, Mac Ultras with 128GB unified ram are all ~$5k. Its still an expensive toy for rich people, it might as well be an H100 for 99.9% of the population (not devs with high paying jobs, of course).

the value of local models is allowing normal people to access AI without needing to subscribe to cloud services. this is esp imp for the rest of the world where even a 12GB gpu is extremely expensive.

there is no real viable local option that will come even close to Sonnet/Gemini Flash or the cheaper chinese models. Even if your pc costs <$2k you are never going to recoup the hw costs, and the results will be far worse.

by green7ea5 hours ago|

parent|

[-]

I'm using a Strix Halo laptop (~3k, 64GiB) and with Gemma 4 and Qwen 3.6, both at 8 bits, I'm seeing very impressive results.

As a work tool, this is reasonably priced. You can save a bit of money by opting for a non-laptop form factor.

by organsnyder4 hours ago|

parent|

prev|

[-]

My Framework Desktop with 128GB was about half that. I did luck out by buying before RAM prices went crazy, though.

I'm looking forward to the fallout when the data center bubble bursts. There's a good possibility we'll see a glut of hardware, either on the used market or from manufacturers that no longer have massive orders from OpenAI and the like.

by zozbot23418 hours ago|

prev|

[-]

RTX Spark is pretty much the DGX Spark in a laptop form factor, plus some lower-performing chips in the same series to be released later according to rumors. We know quite well how the top-of-the-line chip performs: it's very interesting for some application areas, less so for others.