| model | size | params | backend | ngl | test | t/s |
| --------------------- | --------: | ------: | ------- | --: | -----: | -------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | SYCL | 999 | pp2048 | 851.81 ± 6.50 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | SYCL | 999 | tg128 | 42.05 ± 1.99 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 999 | pp2048 | 2022.28 ± 4.82 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 999 | tg128 | 114.15 ± 0.23 |
| qwen35 27B Q6_K | 23.87 GiB | 26.90 B | SYCL | 999 | pp2048 | 299.93 ± 0.40 |
| qwen35 27B Q6_K | 23.87 GiB | 26.90 B | SYCL | 999 | tg128 | 14.58 ± 0.06 |
| qwen35 27B Q6_K | 23.87 GiB | 26.90 B | Vulkan | 999 | pp2048 | 581.99 ± 0.86 |
| qwen35 27B Q6_K | 23.87 GiB | 26.90 B | Vulkan | 999 | tg128 | 10.64 ± 0.12 |
Edit: I've no idea why one would use gpt-oss-20b at Q8, but the result is basically the same: | model | size | params | backend | ngl | test | t/s |
| --------------------- | --------: | ------: | ------- | --: | -----: | -------------: |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | SYCL | 999 | pp2048 | 854.16 ± 6.06 |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | SYCL | 999 | tg128 | 44.02 ± 0.05 |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | Vulkan | 999 | pp2048 | 2022.24 ± 6.97 |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | Vulkan | 999 | tg128 | 114.02 ± 0.13 |
Hopefully, support for the B70 will continue to improve. In retrospect, I probably should have bought a R9700 instead...In that particular model family, the choices are 20B and 120B, so 20B higher quant fits in VRAM, while you'd be settling for 120B at a lower quant. Is it that 20B MXFP4 is comparable in performance so no need for Q8?
Or is the insight simply that there are better models available now and the emphasis is on gpt-oss-20b, not Q8?
Though, looking inside my "gpt-oss 20B MXFP4 MoE" model, it looks to also be quantized the same way as the Q8, so that was probably an overstatement on my part.
Still, the Q8 is 12.1 GB and the FP16 is 13.8 GB. Not the ~1:2 ratio you might expect.
| model | size | params | backend | ngl | test | t/s |
| --------------------- | ---------: |--------: | -------- | --: |------: |----------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 999 | pp2048 | 10179.12 ± 52.86 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 999 | tg128 | 326.82 ± 7.82 |
| qwen35 27B Q6_K | 23.87 GiB | 26.90 B | CUDA | 999 | pp2048 | 3129.92 ± 5.12 |
| qwen35 27B Q6_K | 23.87 GiB | 26.90 B | CUDA | 999 | tg128 | 53.45 ± 0.15 |
build: 9d34231bb (8929)
gpt-oss-20b-MXFP4.gguf
Qwen3.6-27B-UD-Q6_K_XL.gguf
Using MXFP4 of GPT-OSS because it was trained quantization-aware for this quantization type, and it's native to the 50xx.5090 gets maybe 100TPS with MTP
Which might not sound like much, but 2months in llm time is a long time, especially regarding support for new hardware like the r9700.