V4 is natively mixed FP4 and FP8, so significantly less than that. 50 GB max unquantized.
My Mac can fit almost 70B (Q3_K_M) in memory at once, so I really need to try this out soon at maybe Q5-ish.
Streaming weights from RAM to GPU for decode makes no sense at all because batching requires multiple parallel streams.
Streaming weights from SSD _never_ makes sense because the delta between SSD and RAM is too large. There is no situation where you would not be able to fit a model in RAM and also have useful speeds from SSD.