undefined

Technically DS is an independent sibling of Citroën within Stellantis, a sprawling car conglomerate that owns a dog’s dinner of car brands in Europe and USA.

by orthoxerox3 hours ago|

parent|

[-]

It's still the Lexus to Citroen's Toyota.

by Hamuko2 hours ago|

parent|

prev|

[-]

If we want to get really technical, “DS4” is a model from Citroën and they later spun out the DS lineup into its separate brand, with the “Citroën DS4” becoming “DS 4”, “DS” being the make and “4” being the model.

by pavlov2 hours ago|

parent|

[-]

And even more pedantically, DS has recently adopted a new naming scheme where the former DS 4 is now written as DS N°4, pronounced "number 4"...

Their stated inspiration for this SEO bomb is Chanel perfumes.

by drcongo3 hours ago|

parent|

prev|

[-]

Pavlov's dog's dinner?

by insensible8 hours ago|

parent|

prev|

[-]

Trekkies are experiencing a major regression from Deep Space Nine.

by RALaBarge2 hours ago|

parent|

[-]

They never should have trusted Qwark

by jofzar9 hours ago|

parent|

prev|

[-]

I am actually kind of disappointed it wasn't a deep dive on the dual shock 4

by smcleod2 hours ago|

prev|

[-]

That's the flash version not the full model and only at Q2-3~ so while impressive it's still quite different from the full model.

by rurban2 hours ago|

parent|

[-]

Not really. I'm building now another fast C compiler with DeepSeek 4 Flash, and rarely have to step outside to use Pro or Sonnet, gpt or kimi-2.6. Flash is very capable of almost everything.

by Wowfunhappy1 hours ago|

prev|

[-]

Thanks. How is DwarfStar4 different from llama.cpp?

by zozbot2345 hours ago|

prev|

[-]

> The blog post implies that it currently requires 96GB of VRAM.

Has anyone tested what happens if you try and run this on lower-RAM Macs? It might work and just be a bit slower as it falls back on fetching model layers from storage.

by conradkay5 hours ago|

parent|

[-]

It'd be way slower since you'd be doing that work every token

by zozbot2345 hours ago|

parent|

[-]

True (with 64GB RAM it'd have to fetch 20% of its active experts from disk already, about 650MB/tok at 2-bit quant - and that percentage rises quickly as you lower RAM further); my question is just a more practical one about whether it runs at all, how bad the slowdown is, and to what extent you might be able to get some of that decode throughput back by running multiple (slower) agent sessions in parallel under a single Dwarf Star 4 server.

by rpigab4 hours ago|

prev|

[-]

I knew Death Stranding 3 wasn't out yet!

by DeathArrow8 hours ago|

prev|

[-]

>The blog post implies that it currently requires 96GB of VRAM.

From the Github page it seems it only supports Apple and DGX Spark. I have 128 GB of RAM and a 3090 but it probably won't work.

by thomasm6m67 hours ago|

parent|

[-]

FYI, llama.cpp (which antirez/ds4 is inspired by) supports system ram. E.g. [1] is a good guide for running a similar-sized model with 128gb ram and a 3090-sized GPU.

[1] https://unsloth.ai/docs/models/tutorials/minimax-m27

(Unsloth's deepseek-v4 support is still WIP)

by DeathArrow6 hours ago|

parent|

[-]

Thanks, I can run Qwen 3.6 27B with vllm, but I was curious about antirez tool.

by embedding-shape38 minutes ago|

parent|

[-]

Have you had it getting stuck in endless loops maybe ~10-20% of the invocations? Seems it happens for both the responses and chatcompletion APIs, and no matter what inference parameters I try it happens at least for 1/10 of the requests, I've tried every compatible vLLM version + currently using it from git (#main) yet the issue persists.

Seems to happen with various quantizations too, even the NVFP4 versions and any others, so seems like a deeper issue to me, or hardware incompatible perhaps.

by manmal6 hours ago|

parent|

prev|

[-]

It wouldn’t be useful with your setup, probably 3-4 token per second.

by DeathArrow5 hours ago|

parent|

[-]

Yep, maybe I can open a feature request if it makes sense technically.

by zozbot2344 hours ago|

parent|

[-]

Arguably it makes more sense technically to get the model support into llama.cpp, which provides many options for GPU+CPU split inference already.