undefined

points

by ls6129 hours ago |

comments

by simonw9 hours ago|

[-]

Unsloth often turn them around within a few hours, they might have gone to bed already though!

Keep an eye on https://huggingface.co/unsloth/models

Update ten minutes later: https://huggingface.co/unsloth/DeepSeek-V4-Pro just appeared but doesn't have files in yet, so they are clearly awake and pushing updates.

by mohsen18 hours ago|

parent|

[-]

"2 minutes ago" https://huggingface.co/unsloth/DeepSeek-V4-Pro

by EnPissant8 hours ago|

parent|

prev|

[-]

Those are quants, not distills.

by inventor77779 hours ago|

prev|

[-]

Weren't there some frameworks recently released to allow Macs to stream weights from fast SSDs and thus fit way more parameters than what would normally fit in RAM?

I have never tried one yet but I am considering trying that for a medium sized model.

by simonw9 hours ago|

parent|

[-]

I've been calling that the "streaming experts" trick, the key idea is to take advantage of Mixture of Expert models where only a subset of the weights are used for each round of calculations, then load those weights from SSD into RAM for each round.

As I understand it if DeepSeek v4 Pro is a 1.6T, 49B active that means you'd need just 49B in memory, so ~100GB at 16 bit or ~50GB at 8bit quantized.

v4 Flash is 284B, 13B active so might even fit in <32GB.

by zozbot2348 hours ago|

parent|

[-]

The "active" count is not very meaningful except as a broad measure of sparsity, since the experts in MoE models are chosen per layer. Once you're streaming experts from disk, there's nothing that inherently requires having 49B parameters in memory at once. Of course, the less caching memory does, the higher the performance overhead of fetching from disk.

by zargon8 hours ago|

parent|

prev|

[-]

> ~100GB at 16 bit or ~50GB at 8bit quantized.

V4 is natively mixed FP4 and FP8, so significantly less than that. 50 GB max unquantized.

by inventor77779 hours ago|

parent|

prev|

[-]

Ahh, that actually makes more sense now. (As you can tell, I just skimmed through the READMEs and starred "for later".)

My Mac can fit almost 70B (Q3_K_M) in memory at once, so I really need to try this out soon at maybe Q5-ish.

by EnPissant8 hours ago|

parent|

prev|

[-]

Streaming weights from RAM to GPU for prefill makes sense due to batching and pcie5 x16 is fast enough to make it worthwhile.

Streaming weights from RAM to GPU for decode makes no sense at all because batching requires multiple parallel streams.

Streaming weights from SSD _never_ makes sense because the delta between SSD and RAM is too large. There is no situation where you would not be able to fit a model in RAM and also have useful speeds from SSD.

by 3 hours ago|

parent|

[-]

deleted

by simonw7 hours ago|

parent|

prev|

[-]

There have been some very interesting experiments with streaming from SSD recently: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/

by zozbot2348 hours ago|

parent|

prev|

[-]

These are more like experiments than a polished release as of yet. And the reduction in throughput is high compared to having the weights in RAM at all times, since you're bottlenecked by the SSD which even at its fastest is much slower than RAM.

by the_sleaze_9 hours ago|

parent|

prev|

[-]

Do you have the links for those? Very interested

by inventor77779 hours ago|

parent|

[-]

Sure!

Note: these were just two that I starred when I saw them posted here. I have not looked seriously at it at the moment,

https://github.com/danveloper/flash-moe

https://github.com/t8/hypura