Also, 2 gens old means bad performance at ray tracing, abysmal path tracing if at all. Pretty sure it can't run smoothly CP2077 in native 4k without dlss upscalers with all on ultra.
When I bought, I paid $850 a piece. And I needed one anyways for the gaming I was going to do.
My guess is the next good time to buy is going to be 24-36 months from now, depending on how the AI bubble goes.
---
I'll add to this, I personally don't like Apple hardware (not so much related to the hardware as their company philosophy) but their machines with unified memory (or AMDs latest unified memory offerings) get pretty equivalent speeds to my 3090s, and are probably a much better modern entrypoint to local llms.
There's a reason the joke is that Silicon Valley software devs bought up all the Mac minis for OpenClaw.
You can get a 48gb unified RAM M4 pro mac mini for ~2k. If you're not going to do much else with the machine, it's what I'd pick as my budget inference device right now. Spend a year of claude now, get ~150tok/s for the next decade (plus) for ~free.
If you want more capable and are willing to spend a little more, go with the newer Ryzen AI Max+ 395 machines.
You'll spend less on power too.
My last suggestion would be to go buy an RTX3090 at this point. You can do a lot better for a lot cheaper.
How do AMD cards perform with LLMs? A 9070 is sold for ~$600 and has 16GB VRAM
16 GiB won't fit you much, so you'd probably want at least 2x, and preferably 3x of those, and then you need a motherboard, power, etc. that can handle that.
Since you're running quantized (at UD-Q4_K_XL) , check out the "qat" models (unsloth/gemma-4-26B-A4B-it-qat-GGUF) !
- https://huggingface.co/unsloth/gemma-4-26B-A4B-it-qat-GGUF (With "Jun 9 Update: Added MTP support.")
- https://blog.google/innovation-and-ai/technology/developers-...
> Quantization-Aware Training (QAT) [...] allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model
I've actually tried this exact same model locally as well.. albeit on just a single 3090 at 128k context and I got around 40-60tok/s with Q4_K quantization.
The thing that bugged me the most was really the quality of the output on moderately complex real-world coding tasks. Having to switch between "prompt/vibe" and "manually implement" is such a big context switch burden, because you really have to ask yourself every few minutes if you're "holding it wrong" or the model is just too stupid.
It also doesn't really seem to handle transitions from "low-level implementation detail" to "high-level design" well, e.g., it wouldn't easily render tables and such. With Claude I don't have this issue.. so I think for now my verdict would be that it's not really a viable replacement. I really hope it will be in a few months time.
Oh and I used "aider" to replace claude CLI, which maybe that's also sub-optimal.. I'm not sure. The MCP marketplaces are useful of course, though arguably you could just manually replace them over time.
It's prone to thinking longer and more repetitively, again - it's definitely not opus 4.7/4.8.
I've been using pi.dev as my harness for it, and been pleasantly surprised by how nice it feels (I have used aider, but only very briefly and quite a while back - so I can't realistically compare).
I would say it's roughly where I felt claude was a year back - Most of the sessions need to be more "pair programming" and less "I let it run for hours".
I'm a big fan of frequent "human in the loop" style workflows even when I'm on something like opus at work, though. I have opinions about lots of things, and re-inforcing that the model should stop and ask frequently seems to get me considerably better output, without having to "re-roll" if you will.
I've done a good bit of management, and I think it's roughly producing what a junior dev might produce in a day every 5 minutes. And just like a junior dev, you need to be steering it back on track fairly often.
Opus feels more like a mid-level at this point. I can hand it a chunk of work and "leave" but I still get better output if I'm checked-in and watching/steering.
I've used Claude Opus to quickly and effectively pound out some 100-200 line scripts that integrate with a vendor's API, and it one-shotted them both almost perfectly.
I wonder if for a lot of these local models, the scope of the AI assistance should simply be smaller: You architect the tools and the function definitions, and then tell AI to implement one at a time? Does anyone do that rigorously?
A single RTX-3090 will do approximately the same tok/s, but it won't fit the entire 300k context in VRAM.
Sometimes that matters, a lot of times it doesn't.
On the speed front - MOE models are great. Biggest perf difference in modern models is the move to MOE architectures.
I get very similar quality from the both the Gemma-4 31B dense model, and the Gemma-4 26B MOE model (both at Q4 quant) but the MOE version runs at ~3 times the speed (150tok/s vs 46tok/s).