I am just saying it's not as flexible/cost-free as you would on a 'normal' von Neumann-style CPU.
I would love to see Rust-based code that obviates the need to write CUDA kernels (including compiling to different architectures). It feels icky to use/introduce things like async/await in the context of a GPU programming model which is very different from a traditional Rust programming model.
You still have to worry about different architectures and the streaming nature at the end of the day.
I am very interested in this topic, so I am curious to learn how the latest GPUs help manage this divergence problem.