undefined

points

by christkv9 hours ago |

comments

by Havoc9 hours ago|

[-]

In general you’re mem bandwidth constrained so cpu vs gpu often ends up similar on APUs

by fulafel8 hours ago|

parent|

[-]

There are ways to trade off compute power for memory bandwidth (like MTP and other speculative decoding approaches). The CPU and GPU would need to be able to share the same cache for this to work. In the Strix Halo case the GPU has a private cache on the GPU die I think, which is the snag.

by cafkafk9 hours ago|

prev|

[-]

If you get the inference engine to route the heavy matrix math to the GPU and the speculative drafting to the CPU without choking on latency it's probably gonna be very fast.

Would love to see the benchmarks if someone actually pulls something like that off.