undefined

points

[-]

My GB10 Spark-alike is absolutely amazingly fun… but it is not cost effective. Step 3.7 Flash is shockingly capable (IQ4_XS and used for web dev mainly), but it cost me $6800 AUD. They’re even more expensive now. The numbers just don’t make sense: with proper triple head MTP I can get it up to ~40tk/s decode and it runs at around 1000+ tk/s prefill.

$6800 is a lot of API credits for GLM, for example, on any provider you want to use.

Now being able to run models uncensored and with privacy has value! But the cost for these is rough today.

I still am going to buy a second one haha

by c7b22 hours ago|

prev|

[-]

My 2c: you don't need the Strix Halo desktop, the chip comes in many rigs, most of them cheaper, the performance difference isn't worth it. It used to be half the price of a DGX Spark or a Mac with 128GB RAM. If you can still find it at that price I'd say it's the best bang for your buck. Otherwise, Macs have 2-3x the memory bandwidth of the DGX Spark, depending on the chip, so I'd prefer them. Unless you're planning on building a cluster. The DGX Spark has two 100GB/s connectors, ideal for clustering. But I haven't checked what else you could get for the price of two DGX Sparks.

by brandensilva17 hours ago|

parent|

[-]

Thoughts on a M5 Ultra 768GB if it drops? What's the price to make it worth it for you over a spark cluster?

I'm wanting to run Kimi 2.6/2.7 GGUF on it and just slap it in the server rack, but trying to decide if a spark cluster makes more sense.

by PeterStuer12 hours ago|

parent|

[-]

The M3 with 512GB is currently sitting at around 30K, used. You can extrapolate from there.

by brandensilva7 hours ago|

parent|

[-]

[dead]

by lee_ars23 hours ago|

prev|

[-]

I'm currently fiddling with a DGX Spark and Qwen3.6-35B-A3B (specifically Qwen3.6-35B-A3B-NVFP4 under vLLM, with EAGLE3 speculative decoding via eagle3-dogacel-vllm), and it's pretty okay in terms of smarts. The speed is relatively usable at about 50 tok/sec with a 256k context window, and it's definitely smart enough to one-shot some basic coding tasks. I had it doing reverse engineering/disassembly of some ancient MS-DOS assembly language games from the 80s and it handled the task well and produced good outputs.

But it's also really easy to trip up. I fed it some of my Ars pieces and asked it to analyze themes and composition, and it got into a looping argument with me over how it was unable to analyze "my" writing because "the user cannot be the article author, the user is the user, the user did not write the article, the article author wrote the article." I was utterly unable to convince it that I was in fact me.

Qwen3.6-35B-A3B hums along at about 50GB of RAM used with --gpu-memory-utilization=0.42. I haven't tried Qwen3.6-27B (I'd likely grab Qwen3.6-27B-FP8, I think), but I'm curious to see if it makes much of a difference.

by coder54320 hours ago|

parent|

[-]

Compared to a dynamic quant like Unsloth's UD-Q4_K_XL, which keeps some important parameters in higher precision, a basic NVFP4 quant seems to do a lot more damage to the model unless it is carefully calibrated.

I would recommend using llama-server if you're just on a single Spark. You get access to dynamic quants like that more easily, the performance is not that different from vLLM most of the time these days, and it is much faster and easier to switch between models.

As far as intelligence goes, Qwen3.6-27B is much smarter than the 35B-A3B model, but that's also not the sort of thing to argue with an AI model about in the first place. Just open a new chat and try again.

Gemma-4-31B is not as good at agentic use cases as Qwen3.6-27B, but it is a fairly balanced model overall, and worth trying out too. Its MTP can nearly triple the performance of the model, where the benefits of MTP or Eagle seem more limited for Qwen3.6-27B in my testing, maybe doubling the speed.

by cpburns200920 hours ago|

parent|

prev|

[-]

Looping is a common problem with the Qwen models. I've had good luck using --repeat-penalty=1.1 with llama.cpp and 27B. vLLM should have a similar option.

by etdznots4 hours ago|

parent|

[-]

This is the default value!

by cpburns20093 hours ago|

parent|

[-]

Llama.cpp defaults to 1.0 (disabled) and so does vLLM. It looks like only ollama defaults to 1.1.

by 7 hours ago|

parent|

prev|

[-]

deleted

by rnxrx22 hours ago|

parent|

prev|

[-]

There are also nvfp4 quants of Qwen 3.6 27/35 floating around. I've done benchmarks of both and the quality difference vs fp8/bf16 was barely notable. Honestly the nvfp4 capability is the most interesting feature of the Spark (at least for me).

by anon37383921 hours ago|

parent|

prev|

[-]

I use Qwen 3.6 35B-A3B constantly, but I don’t see the type of behavior you mentioned. I’m using Unsloth’s Q8_K_XL quant.

by gnerd0019 hours ago|

parent|

prev|

[-]

`llama-server` looping mitigations --repeat-penalty something greater than 1.0, set reasoning/thinking OFF explicitly, prefer a gguf with more than 4bit quant

by pkroll23 hours ago|

prev|

[-]

Check the LLM benchmarks once it's out: it's such a common use case for these kinds of machines, you won't be waiting long.