undefined

points

[-]

Just ran llama-bench at home with the similar priced AMD AI PRO R9700 32G. The phoronix numbers look extremely low? Probably I misunderstand their test bench. Anyway, here are some numbers. Maybe someone with access to a B70 can post a comparison.

Tried to use the same model as the article:

llama-bench -m gpt-oss-20b-Q8_0.gguf -ngl 999 -p 2048 -n 128

AMD R9700 pp2048=3867 tg128=175

And a bigger model, because testing a tiny model with a 32GB card feels like a waste:

llama-bench -m Qwen3.6-27B-UD-Q6_K_XL.gguf -ngl 999 -p 2048 -n 128

AMD R9700 pp2048=917 tg128=22

by Mindless211214 hours ago|

parent|

[-]

As of b8966, it is still not great.

  | model                 |      size |  params | backend | ngl |   test |            t/s |
  | --------------------- | --------: | ------: | ------- | --: | -----: | -------------: |
  | gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | SYCL    | 999 | pp2048 |  851.81 ± 6.50 |
  | gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | SYCL    | 999 |  tg128 |   42.05 ± 1.99 |
  | gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan  | 999 | pp2048 | 2022.28 ± 4.82 |
  | gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan  | 999 |  tg128 |  114.15 ± 0.23 |
  | qwen35 27B Q6_K       | 23.87 GiB | 26.90 B | SYCL    | 999 | pp2048 |  299.93 ± 0.40 |
  | qwen35 27B Q6_K       | 23.87 GiB | 26.90 B | SYCL    | 999 |  tg128 |   14.58 ± 0.06 |
  | qwen35 27B Q6_K       | 23.87 GiB | 26.90 B | Vulkan  | 999 | pp2048 |  581.99 ± 0.86 |
  | qwen35 27B Q6_K       | 23.87 GiB | 26.90 B | Vulkan  | 999 |  tg128 |   10.64 ± 0.12 |

Edit: I've no idea why one would use gpt-oss-20b at Q8, but the result is basically the same:

  | model                 |      size |  params | backend | ngl |   test |            t/s |
  | --------------------- | --------: | ------: | ------- | --: | -----: | -------------: |
  | gpt-oss 20B Q8_0      | 11.27 GiB | 20.91 B | SYCL    | 999 | pp2048 |  854.16 ± 6.06 |
  | gpt-oss 20B Q8_0      | 11.27 GiB | 20.91 B | SYCL    | 999 |  tg128 |   44.02 ± 0.05 |
  | gpt-oss 20B Q8_0      | 11.27 GiB | 20.91 B | Vulkan  | 999 | pp2048 | 2022.24 ± 6.97 |
  | gpt-oss 20B Q8_0      | 11.27 GiB | 20.91 B | Vulkan  | 999 |  tg128 |  114.02 ± 0.13 |

Hopefully, support for the B70 will continue to improve. In retrospect, I probably should have bought a R9700 instead...

by jaimie8 hours ago|

parent|

[-]

"I've no idea why one would use gpt-oss-20b at Q8" - would you mind expanding on this comment?

In that particular model family, the choices are 20B and 120B, so 20B higher quant fits in VRAM, while you'd be settling for 120B at a lower quant. Is it that 20B MXFP4 is comparable in performance so no need for Q8?

Or is the insight simply that there are better models available now and the emphasis is on gpt-oss-20b, not Q8?

by Mindless21126 hours ago|

parent|

[-]

The parameters in the original gpt-oss-20B model are "post-trained with MXFP4 quantization", so there just isn't much to gain by quantizing to Q8. If you look inside the Q8 model, most of the parameters are MXFP4 anyway.

Though, looking inside my "gpt-oss 20B MXFP4 MoE" model, it looks to also be quantized the same way as the Q8, so that was probably an overstatement on my part.

Still, the Q8 is 12.1 GB and the FP16 is 13.8 GB. Not the ~1:2 ratio you might expect.

by ycui75 hours ago|

parent|

prev|

[-]

At this speed, people end up paying more on electricity than api calls. (California electricity)

by magicalhippo14 hours ago|

parent|

prev|

[-]

For reference in case it's interesting to someone, a 5090 on Windows 11 with CUDA 13.1

  | model                 |       size |   params | backend  | ngl |   test |              t/s |
  | --------------------- | ---------: |--------: | -------- | --: |------: |----------------: |
  | gpt-oss 20B MXFP4 MoE |  11.27 GiB |  20.91 B | CUDA     | 999 | pp2048 | 10179.12 ± 52.86 |
  | gpt-oss 20B MXFP4 MoE |  11.27 GiB |  20.91 B | CUDA     | 999 |  tg128 |    326.82 ± 7.82 |
  | qwen35 27B Q6_K       |  23.87 GiB |  26.90 B | CUDA     | 999 | pp2048 |   3129.92 ± 5.12 |
  | qwen35 27B Q6_K       |  23.87 GiB |  26.90 B | CUDA     | 999 |  tg128 |     53.45 ± 0.15 |
  
  build: 9d34231bb (8929)

  gpt-oss-20b-MXFP4.gguf
  Qwen3.6-27B-UD-Q6_K_XL.gguf

Using MXFP4 of GPT-OSS because it was trained quantization-aware for this quantization type, and it's native to the 50xx.

by ycui75 hours ago|

parent|

[-]

You can get 120TPS (144 peak) with Qwen3.6-27B on RTX PRO 6000 with autoround when MTP enabled. It runs faster than sonnet api calls.

5090 gets maybe 100TPS with MTP

by 14 hours ago|

parent|

prev|

[-]

deleted

by andy_xor_andrew14 hours ago|

parent|

prev|

[-]

the build they use is from February, over two months old: https://github.com/ggml-org/llama.cpp/releases/tag/b8121

Which might not sound like much, but 2months in llm time is a long time, especially regarding support for new hardware like the r9700.

by zargon15 hours ago|

prev|

[-]

Also from phoronix, a comparison with AMD R9700 and RTX 6000 Ada (because Nvidia has not sent them a blackwell card): https://www.phoronix.com/review/intel-arc-pro-b70/2