undefined

points

[-]

I own a single R9700 for the same reason you mentioned, looking into getting a second one. Was a lot of fiddling to get working on arch but RDNA4 and ROCm have come a long way. Every once in a while arch package updates break things but that’s not exclusive to ROCm.

LLM’s run great on it, it’s happily running gemma4 31b at the moment and I’m quite impressed. For the amount of VRAM you get it’s hard to beat, apart from the Intel cards maybe. But the driver support doesn’t seem to be that great there either.

Had some trouble with running comfyui, but it’s not my main use case, so I did not spent a lot of time figuring that out yet

by canpan2 days ago|

parent|

[-]

Thanks for the answer. Brings my hope up. Looking in my local shops, I can get 3 cards for the price of one 5090.

May I ask, what kind of tok/s you are getting with the r9700? I assume you got it fully in vram?

by jhgorrell2 days ago|

parent|

[-]

Stock install, no tuning.

  $uname -r
  6.8.0-107-generic
  $ollama --version
  ollama version is 0.20.2
  $ollama run "gemma4:31b" --verbose "write fizzbuzz in python."
  [...]
  total duration:       45.141599637s
  load duration:        143.633498ms
  prompt eval count:    21 token(s)
  prompt eval duration: 48.047609ms
  prompt eval rate:     437.07 tokens/s
  eval count:           1057 token(s)
  eval duration:        44.676612241s
  eval rate:            23.66 tokens/s

by theoli2 days ago|

parent|

prev|

[-]

I have a dual R9700 machine, with both cards on PCIe gen4 x8 slots. The 256bit GDDR6 memory bandwidth is the main limiting factor and makes dense models above 9b fairly slow.

The model that is currently loaded full time for all workloads on this machine is Unsloth's Q3_K_M quant of Qwen 3.5 122b, which has 10b active parameters. With almost no context usage it will generate 59 tok/sec. At 10,000 input tokens it will prefill at about 1500 tok/sec and generate at 51 tok/sec. At 110,000 input tokens it will prefill at about 950 tok/sec and generate at 30 tok/sec.

Smaller MoE models with 3b active will push 70 tok/sec at 10,000 context. Dense models like Qwen 3.5 27b and Devstral Small 2 at 24b will only generate at around 13 - 15 tok/sec with 10,000 context.

This is all on llama.cpp with the Vulkan backend. I didn't get to far in testing / using anything that requires ROCm because there is an outstanding ROCm bug where the GPU clock stays at 100% (and drawing like 60 watts) even when the model is not processing anything. The issue is now closed but multiple commenters indicate it is still a problem. Using the Vulkan backend my per-card idle draw is between 1 and 2 watts with the display outputs shut down and no kernel frame buffer.

by chao-2 days ago|

prev|

[-]

Talking to friends who have fought more homelab battles than I ever will, my sense is that (1) AMD has done a better job with RDNA4 than the past generations, and (2) it seems very workload-dependent whether AMD consumer gear is "good value", "more trouble", or both at the same time.

Edit: I misread the "2x r9700" as "2 rx9700" which differs from the topic of this comment (about RNDA4 consumer SKUs). I'll keep my comment up, but anyone looking to get Radeon PRO cards can (should?) disregard.

by KennyBlanken2 days ago|

parent|

[-]

Given RDNA3 was a pathetic joke, it wouldn't be hard for them to do a better job.

by djsjajah2 days ago|

prev|

[-]

I have 2 of them. I would advise against if you want to run things like vllm. I have had the cards for months and I still have not been able to create a uv env with trl and vllm. For vllm, it’s works fine in docker for some models. With one gpu, gpt-oss 20b decoding at a cumulative 600-800tps with 32 concurrent requests depending on context length but I was getting trash performance out of qwen3.5 and Gemma4

If I were to do it again, I’d probably just get a dgx spark. I don’t think it’s been worth the hassle.

by girvo2 days ago|

parent|

[-]

FWIW I’m in love with my Asus GX10 and have been learning CUDA on it while playing with vllm and such. Qwen3.5 122B A10 at ~50tps is quite neat.

But do beware, it’s weird hardware and not really Blackwell. We are only just starting to squeeze full performance out of SM12.1 lately!

by cyberax2 days ago|

prev|

[-]

I have this setup, with 2x 32Gb cards. It's perfect for my needs, and cheaper than anything comparable from NV.