upvote
That 3090 is going to burn 750W and it will still cap you at a 4 bit quant and ~48K context. Here's someone who worked through it:

https://github.com/noonghunna/qwen36-27b-single-3090

Flies though (50-70tps is impressive for a model this smart)

I went through roughly the same process to get it working on my M2 Macbook Pro... at awful speeds of course, since models like this one are mostly bound by memory bandwidth.

reply
> That 3090 is going to burn 750W

The 3090's TPD is 350W, but given that LLM's token generation isn't compute bound, people usually undervolt these cards to reduce power consumption. IIRC you can get as low as 200-250W without any degradation. Caveat these figures are without speculative decoding and at batch size =1.

reply
This is correct. I have (4) 3090s in my inference server, and they are each capped at 250w. I run Qwen 3.5 122B-A10 at about 45-50tok/s on this and am quite happy with it. At idle it draws around 95-105w for all four, which is a bit high, but tolerable.
reply
My eyes glaze over reading all the AI produced verbiage.

I did find a few useful parameter settings I've already discovered using my single 3090 and ollama.

I'm just remarking that the LLMs overwhelm me with minutiae, especially as I'm working on code design. I frequently ask it to restate concisely, and that helps.

[edited to mention ollama as a nice alt]

reply