undefined

points

by oceanplexian1 days ago |

comments

by chorizo23 hours ago|

[-]

That’s 24GB VRAM. Not enough to run a 27B model at a useful quant+context size.

by nsbk22 hours ago|

parent|

[-]

I beg to differ. Have a look at this repo with single/double 3090 optimized configs for Qwen and Gema models: https://github.com/noonghunna/club-3090

by sanderjd23 hours ago|

parent|

prev|

[-]

Yeah seems to me like the mac studios with the unified memory architecture are genuinely good bang for the buck at the moment, because of this memory size consideration?

by 22 hours ago|

parent|

prev|

[-]

deleted

by SkitterKherpi23 hours ago|

parent|

prev|

[-]

You can run 8bit 27B models at 24GB, it's definitely enough for the model size.

by SwellJoe22 hours ago|

parent|

[-]

The 8-bit quantized 27B Qwen 3.6 is 29GB. You absolutely cannot run that entirely on a 24GB GPU.

You could run a 4-bit, which is 16-17GB. But, you'd need a smallish context or you'd need to quantize your KV cache. Something like TurboQuant or RotorQuant might help.

32GB is the lower bound for comfortably running this size model. I'd maybe even say 64GB is right-sized, because a 256k context is nice to have for agentic workflows, and that won't fit on a 32GB card without heavy quantization (but I haven't tried TurboQuant or RotorQuant to know what impact it has on memory use for context).

You could also put some of the model into system RAM, but that defeats the purpose of your argument that a 3090 will outperform a Mac Mini or Mac Studio. If part of a dense model is in system RAM, it absolutely will not outperform a recent unified memory device.

by cpburns200922 hours ago|

parent|

[-]

A 32gb card does run it nicely. I use unsloth's UD-Q5_K_XL at 256k context (k/v at q8_0), and get ~67 t/s on a 5090. I still need to look into MTP.

by adornKey11 hours ago|

parent|

[-]

Nice. I used Q4_K_M to have some headroom. But yours seems to fit nicely.

by pbgcp202620 hours ago|

parent|

prev|

[-]

[dead]

by bityard23 hours ago|

parent|

prev|

[-]

Quantization is a trade-off, though. The quality, while still perhaps good enough for many tasks, is not as good as the full 16-bit weights that the model was designed for/released with.

by pbgcp202620 hours ago|

parent|

[-]

[dead]

by barbacoa21 hours ago|

parent|

prev|

[-]

I'm running qwen 3.6 27b at 8bit quantization and 262k context. It takes 53gb of vram on my system.

by jnovek23 hours ago|

parent|

prev|

[-]

I think that’s only true for MoE models. A dense model like 3.6 27b will require more (plus a KV store).

by bityard23 hours ago|

parent|

[-]

No, even MoE models need to fit into (V)RAM. MoE has faster inference because only a subset of layers are used to predict the next token, but the set of layers used changes with every token.

by angoragoats19 hours ago|

parent|

prev|

[-]

So buy two.

by ThunderSizzle9 hours ago|

prev|

[-]

The cheapest 3090s I could find with any sort of guarantee were pushing $1500.

An AMD AI Pro R9700 32GB brand new is $1350 right now.

After some tweaking, I had it running faster than the models the 3090 could run, and it could obviously run with higher context limits and bigger models due to the extra vram.

by iagooar23 hours ago|

prev|

[-]

My problem is I won't accept anything lower than the 96GB the RTX Pro 6000 Blackwell has. My dream is a workstation with 2x Pro 6000 to run DeepSeek v4 Flash comfortably, possibly qwen 3.6 / ornith on turbo speed.

But man, I have never purchased a computer which is more expensive than a decent family car.

by d0gsg0w00f17 hours ago|

parent|

[-]

I had this dream too. My 2xDGX Sparks arrive in my reality on Monday.

by jnovek23 hours ago|

prev|

[-]

An M1 Ultra has 800gbps unified memory. It’s nothing to do with Apple, it’s their microarchitecture. They’re just about the only game in town with high-bandwidth memory if you want >24GB (for less than $10k, anyway).

by murderfs21 hours ago|

parent|

[-]

A 5090 gets you 32GB with 1.8 TB/s of memory bandwidth for ~$4k, RTX A6000 gets you 48GB at 768 GB/s for ~$3.5k, 2x 3090 gets you 48GB for $2000 or so, and if you're willing to go into the wilderness, there are much cheaper options like the AMD MI50.

by jtbaker17 hours ago|

parent|

[-]

The RTX 5000 Pro 72GB seems like kind of a sleeper to me, and sips < 300W of power, approx 1/2 that of its big bro the RTX 6000. Kind of dream about installing it in a 10" rack, it seems like it might be able to work? @jeffgeerling you out there?

https://www.microcenter.com/product/709071/pny-nvidia-rtx-pr...

by angoragoats1 hours ago|

parent|

prev|

[-]

I'd also like to call out that "high bandwidth memory" (HBM) is a specifically defined thing[0], and is used in high end GPUs, and notably not used in Apple's machines.

I know you probably weren't referring to this type of memory in your post, but IMO it might be worth avoiding this term in the future unless you're referring to HBM, the standard.

[0] https://en.wikipedia.org/wiki/High_Bandwidth_Memory

by angoragoats19 hours ago|

parent|

prev|

[-]

Yeah this is just not the case at all; a 5090 or any of the recent nvidia workstation cards all fit this criteria.

Also, while memory bandwidth is important, it isn’t the only consideration. Apple’s architecture has memory bandwidth equal to a mid-range consumer GPU, but its GPU speed is much, much worse than, say, a 5080 or 5090. This translates into e.g. much slower time to first token on Mac systems compared to dedicated GPUs.

by dheera22 hours ago|

prev|

[-]

32GB V100

by t0mpr1c315 hours ago|

parent|

[-]

Meh. I'd rather have 2x RTX 5060 Ti.