upvote
God, as someone who took their elective on graphics program when GPGPU and computer shaders first became a thing, reading this makes me realize I definitely need an update on what modern GPU uarchs are like now.

Re: heterogenous workload: I'm told by a friend in HPC that the old advice about avoiding diverging branches within warps is no longer much of an issue – is that true?

reply
That advice applies within warps, to single 'threads' (effectively SIMD lanes) whereas the article is consistently about running heterogenous tasks on different warps.
reply
Yes, that's the idea.

GPU-wide memory is not quite as scarce on datacenter cards or systems with unified memory. One could also have local executors with local futures that are `!Send` and place in a faster address space.

reply
This is already happening in C++, NVidia is the one pushing the senders/receivers proposal, which is one of the possible co-routine runtimes to be added into C++ standard library.
reply
A ton of GPU workloads require leaving large amounts of RAM resident on the GPU and running computation with some new data from the CPU.
reply