undefined

1x RTX3090 is absolutely not overkill for gaming however. Nowadays it's barely enough to get 60FPS in 4K in some recently released games. But the shocking part is that my 3090 is still probably worth as much as when I bought it about 4 years ago.

by overgard1 hours ago|

parent|

prev|

[-]

Having a second card doesn't really work well for gaming.

by googletron1 hours ago|

parent|

prev|

[-]

what?

by kakacik1 hours ago|

parent|

[-]

AFAIK nvidia cards dont work in tandem (aka sli in the past) very well these days. So that aint true.

Also, 2 gens old means bad performance at ray tracing, abysmal path tracing if at all. Pretty sure it can't run smoothly CP2077 in native 4k without dlss upscalers with all on ultra.

by himata41131 hours ago|

parent|

[-]

You can have the 2nd card as an offload for upscaling, frame generation and whatnot.

by irishcoffee1 hours ago|

parent|

[-]

When I'm not running models I use the 2nd one in a pass-thru configuration to a windows vm for various things, usually gaming.

by horsawlarway2 hours ago|

parent|

prev|

[-]

Yes, today is not a great time to purchase hardware.

When I bought, I paid $850 a piece. And I needed one anyways for the gaming I was going to do.

My guess is the next good time to buy is going to be 24-36 months from now, depending on how the AI bubble goes.

---

I'll add to this, I personally don't like Apple hardware (not so much related to the hardware as their company philosophy) but their machines with unified memory (or AMDs latest unified memory offerings) get pretty equivalent speeds to my 3090s, and are probably a much better modern entrypoint to local llms.

There's a reason the joke is that Silicon Valley software devs bought up all the Mac minis for OpenClaw.

You can get a 48gb unified RAM M4 pro mac mini for ~2k. If you're not going to do much else with the machine, it's what I'd pick as my budget inference device right now. Spend a year of claude now, get ~150tok/s for the next decade (plus) for ~free.

If you want more capable and are willing to spend a little more, go with the newer Ryzen AI Max+ 395 machines.

You'll spend less on power too.

My last suggestion would be to go buy an RTX3090 at this point. You can do a lot better for a lot cheaper.

by tracker126 minutes ago|

parent|

[-]

If you're willing to go the AMD route, the AMD Radeon Pro R9700 definitely looks interesting for the price compared to NVidia.

by jmuguy1 hours ago|

parent|

prev|

[-]

Or a really excellent experience playing Satisfactory with the settings cranked up, which is priceless.

by tripleee1 hours ago|

parent|

prev|

[-]

Christ GPU prices have gotten crazy

How do AMD cards perform with LLMs? A 9070 is sold for ~$600 and has 16GB VRAM

by overgard1 hours ago|

parent|

[-]

In my personal experience, I wouldn't bother with 16GB cards for coding -- the useful models are _slightly_ too large to work at any reasonable speed

by lambda1 hours ago|

parent|

prev|

[-]

That should do pretty well. Memory bandwidth is the biggest bottleneck for token generation, at 644 GB/s you should be able to do pretty well on a 9070, while prompt proessing is more compute bound and Nvidia tends to have the edge there.

16 GiB won't fit you much, so you'd probably want at least 2x, and preferably 3x of those, and then you need a motherboard, power, etc. that can handle that.

by tracker124 minutes ago|

parent|

[-]

You can get an R9700 with 32gb vram for ~$1200-1400 depending on where you live, which is probably a better option for AI use than 2x 9070(xt)

by nyrikki2 hours ago|

parent|

prev|

[-]

You can get 60tps with three 1080tis and the sparse model, and I bet two 16gb 5060tis would do the same for ~1200. One 3090 is enough for a useful system, even on an old am4 host.

by flowerthoughts1 hours ago|

parent|

prev|

[-]

In 3.6 years, chances are they are still worth $3k. Unless some new chip fab pops up that can spam the chip market. Even if the AI bubble bursts, I doubt we'll see high-RAM GPUs sell off.

by sieabahlpark2 hours ago|

parent|

prev|

[-]

[dead]

by kpw942 hours ago|

prev|

[-]

> gemma (unsloth/gemma-4-26B-A4B-it-GGUF) models

Since you're running quantized (at UD-Q4_K_XL) , check out the "qat" models (unsloth/gemma-4-26B-A4B-it-qat-GGUF) !

- https://huggingface.co/unsloth/gemma-4-26B-A4B-it-qat-GGUF (With "Jun 9 Update: Added MTP support.")

- https://blog.google/innovation-and-ai/technology/developers-...

by me_bx59 minutes ago|

parent|

[-]

TIL:

> Quantization-Aware Training (QAT) [...] allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model

by twothreeone2 hours ago|

prev|

[-]

> unsloth/Qwen3.6-35B-A3B-MTP-GGUF

I've actually tried this exact same model locally as well.. albeit on just a single 3090 at 128k context and I got around 40-60tok/s with Q4_K quantization.

The thing that bugged me the most was really the quality of the output on moderately complex real-world coding tasks. Having to switch between "prompt/vibe" and "manually implement" is such a big context switch burden, because you really have to ask yourself every few minutes if you're "holding it wrong" or the model is just too stupid.

It also doesn't really seem to handle transitions from "low-level implementation detail" to "high-level design" well, e.g., it wouldn't easily render tables and such. With Claude I don't have this issue.. so I think for now my verdict would be that it's not really a viable replacement. I really hope it will be in a few months time.

Oh and I used "aider" to replace claude CLI, which maybe that's also sub-optimal.. I'm not sure. The MCP marketplaces are useful of course, though arguably you could just manually replace them over time.

by horsawlarway1 hours ago|

parent|

[-]

I don't generally switch to implementing myself on the model, although there are definitely times where I stop it and correct it mid-task.

It's prone to thinking longer and more repetitively, again - it's definitely not opus 4.7/4.8.

I've been using pi.dev as my harness for it, and been pleasantly surprised by how nice it feels (I have used aider, but only very briefly and quite a while back - so I can't realistically compare).

I would say it's roughly where I felt claude was a year back - Most of the sessions need to be more "pair programming" and less "I let it run for hours".

I'm a big fan of frequent "human in the loop" style workflows even when I'm on something like opus at work, though. I have opinions about lots of things, and re-inforcing that the model should stop and ask frequently seems to get me considerably better output, without having to "re-roll" if you will.

I've done a good bit of management, and I think it's roughly producing what a junior dev might produce in a day every 5 minutes. And just like a junior dev, you need to be steering it back on track fairly often.

Opus feels more like a mid-level at this point. I can hand it a chunk of work and "leave" but I still get better output if I'm checked-in and watching/steering.

by unethical_ban1 hours ago|

parent|

prev|

[-]

I'm so out of the loop on this stuff, it's the first time in my IT career I feel really behind on things.

I've used Claude Opus to quickly and effectively pound out some 100-200 line scripts that integrate with a vendor's API, and it one-shotted them both almost perfectly.

I wonder if for a lot of these local models, the scope of the AI assistance should simply be smaller: You architect the tools and the function definitions, and then tell AI to implement one at a time? Does anyone do that rigorously?

by anhtqweb28 minutes ago|

prev|

[-]

Grocery list management and meal planning sounds interesting. Would you mind sharing a little bit more on your use case please?

by gonzalohm2 hours ago|

prev|

[-]

Did you double the tokens per second by adding a second GPU or was the increase significantly less?

by horsawlarway2 hours ago|

parent|

[-]

No real change in inference speed. It basically just allows me to slot in more context or a bigger model.

A single RTX-3090 will do approximately the same tok/s, but it won't fit the entire 300k context in VRAM.

Sometimes that matters, a lot of times it doesn't.

On the speed front - MOE models are great. Biggest perf difference in modern models is the move to MOE architectures.

I get very similar quality from the both the Gemma-4 31B dense model, and the Gemma-4 26B MOE model (both at Q4 quant) but the MOE version runs at ~3 times the speed (150tok/s vs 46tok/s).

by mirekrusin2 hours ago|

parent|

prev|

[-]

You’re adding extra gpu for more vram, not speed.

by agup7922 hours ago|

prev|

[-]

That sounds amazing. If I had some GPUs sitting around, I would totally do it. Sounds expensive to do it otherwise though.