prores decodes faster than realtime single threaded on a decade old CPU too
it doesn't make sense. it's much different with say, a video game, where a texture will be loaded once into VRAM, and then yes, all the work will be done on the GPU. a video will have CPU IO every frame, you are still doing a ton of CPU work. i don't know why people are talking about power efficiency, in a pro editing context, your CPU will be very, very busy with these IO threads, including and especially in ffmpeg with hardware encoding/decoding nonetheless. it doesn't look anything like a video game workload which is what this stack is designed for.
This is valid for a single stream, but the equation changes when you're trying to squeeze the highest # of simultaneous streams into the least amount of CapEx possible. Sure, you still have to transfer it to the CPU cache just before you send it over WebRTC/HTTP/whatever, but you unlock a lot of capacity by using all the rest of the silicon as much as you can. Being able to use a budget/midrange GPU instead of a high-end ultra-high-core-count CPU could make a big difference to a business with the right use-case.
That said, TFA doesn't seem to be targeting that kind of high stream density use-case either. I don't think e.g. Frigate NVR users are going to switch to any of the mentioned technologies in this blog post.
I haven't actually looked into this but it might not be the realm of possibility. But you are generating a frame on GPU, if you can also encode it there, either with nvenc or vulkan doesn't matter. Then DMA the to the nic while just using the CPU to process the packet headers, assuming that cannot also be handled in the GPU/nic