upvote
One of the choke points of all modern video codecs that focus on potential high compression ratios is the arithmetic entropy coding. CABAC for h264 and h265, 16-symbol arithmetic coding for AV1. There is no way to parallelize that AFAIK: the next symbol depends on the previous one. All you can do is a bit of speculative decoding but that doesn’t go very deep. Even when implemented in hardware, the arithmetic decoding is hard to parallelize.

This is especially a choke point when you use these codecs for high quality settings. The prediction and filtering steps later in the decoding pipeline are relatively easy to parallelize.

High throughput CODECs like ProRes don’t use arithmetic coding but a much simpler, table based, coding scheme.

reply
FFv1's range coder has higher complexity than CABAC. The issue is serialization. Mainstream codecs require that the a block depends on previously decoded blocks. Tiles exist, but they're so much larger, and so rarely used, that they may as well not exist.
reply
> the GPU could be used to encode h264, and apparently yes, but it's not really worth it compared to CPU.

It depends on what you're going for. If you're trying to do the absolute highest fidelity for archiving a blu-ray disk, AMD Epyc reigns supreme. That's because you need a lot of flexibility to really dial in the quality settings. Pirates over at PassThePopcorn obsess over minute differences in quality that I absolutely cannot notice with my eyes, and I'm glad they do! Their encodings look gorgeous. This quality can't be achieved with the silicon of hardware-accelerated encoders, and due to driver limitations (not silicon limitations) also cannot be achieved by CUDA cores / execution engines / etc on GPUs.

But if you're okay with a small amount of quality loss, the optimum move for highest # of simultaneous encodes or fastest FPS encoding is to skip the CPU and GPU "general compute" entirely - going with hardware accelerated encoding can get you 8-30 1080p simultaneous encodes on a very cheap intel iGPU using QSV/VAAPI encoding. This means using special sections of silicon whose sole purpose is to perform H264/H265/etc encoding, or cropping / scaling / color adjustments ... the "hardware accelerators" I'm talking about are generally present in the CPU/iGPU/GPU/SOC, but are not general purpose - they can't be used for CUDA/ROCm/etc. Either they're being used for your video pipeline specifically, or they're not being used at all.

I'm doing this now for my startup and we've tuned it so it uses 0% of the CPU and 0% of the Render/3D engine of the iGPU (which is the most "general purpose" section of the GPU, leaving those completely free for ML models) and only utilizing the Video Engine and Video Enhance engines.

For something like Frigate NVR, that's perfect. You can support a large # of cameras on cheap hardware and your encoding/streaming tasks don't load any silicon used for YOLO, other than adding to overall thermal limits.

Video encoding is a very deep topic. You need to have benchmarks, you need to understand not just "CPU vs GPU" ... but down to which parts of the GPU you're using. There's an incredible amount of optimization you can do for your specific task if you take the time to truly understand the systems level of your video pipeline.

reply
> But if you're okay with a small amount of quality loss,

I wouldn't call it a small quality loss. The hardware encoders are tuned for different priorities like live streaming. They have lower quality and/or much higher bitrate.

> If you're trying to do the absolute highest fidelity for archiving a blu-ray disk, AMD Epyc reigns supreme.

You don't need any special CPU to get the highest fidelity as long as you're willing to wait. For archiving purposes any CPU will do, just be prepared to let it run for a long time.

reply
> You don't need any special CPU to get the highest fidelity as long as you're willing to wait.

Correct, but Epyc "reigns supreme" for anyone caring about performance / total FPS throughput, which is relevant for anyone who cares about TFA at all - the purpose of using GPU is to "go faster", and that's what Epyc offers for use cases that also care about extreme fidelity.

> I wouldn't call it a small quality loss. The hardware encoders are tuned for different priorities like live streaming. They have lower quality and/or much higher bitrate.

Sure. It absolutely depends on your use case. We're using it for RDP/KVM-type video, so for us the quality loss is indeed quite "small". Our users care more about "can I read the text clearly?" and less about color-banding. The hardware accelerators do a great job with text clarity so for our use-case it's not much of a noticeable quality loss. I will admit the colors are very noticeably distorted, but the shapes are correct and the contrast/sharpness is good.

Using 0% of the CPU and GPU for encoding is a HUGE win that's totally worth it for us - hardware costs stay super low. Using really old bottom of the barrel CPU's for 30+ simultaneous encodes feels like cheating. Hardware-accelerated encoding also provides another massive win by tangibly reducing latency for our users vs CPU/GPU encoding (it's not just the throughput that's improved, each live frame gets through the pipeline faster too).

I wouldn't use COTS hardware accelerators for archiving Blu-Ray videos. Hell I'm not even aware of any COTS hardware accelerators that support HDR ... they probably exist but I've never stumbled across one. But hardware-accelerated encoding really is ideal for a lot of other stuff, especially when you care about CapEx at scale. If you're at the scale of Netflix or YouTube, you can get custom silicon made that can provide ASIC acceleration for any quality you like. That said, they seem to choose to degrade video quality to save money all the way to the point that 10-20% of their users hate the quality (myself included, quality is one of the primary reasons I use PassThePopcorn instead of the legal streaming services), but that's a business choice, not a technical limitation of ASIC acceleration (that's if you have the scale to pay for custom silicon...COTS solutions absolutely DO have a noticeable quality loss, as you argue).

reply
> We're using it for RDP/KVM-type video, so for us the quality loss is indeed quite "small". Our users care more about "can I read the text clearly?" and less about color-banding. The hardware accelerators do a great job with text clarity so for our use-case it's not much of a noticeable quality loss.

This is a perfect use case for hardware video acceleration.

The hardware encoder blocks are great for anything live streaming related. The video they produce uses a lot higher bitrate and has lower quality than what you could get with a CPU encoder, but if doing a lot of real-time encodes is important then they deliver.

reply
Common video codecs are often hardware accelerated. This should be on the CPU side quite often, as there are a lot of systems without dedicated GPUs that still play video, like Notebooks and smart phones. So in the end it's less about being parallelizable, but if it beats dedicated hardware, to which the answer should almost always be no.

P.S.: In video decoding speed is only relevant up to a certain point. That being: "Can I decode the next frame(s) in time to show it/them without stuttering". Once that has been achieved other factors such as power drainage become more important.

reply
It is my understanding that hardware accelerated video encoders (as in the fixed-function ones built into consumer GPUs) produce a lower quality output than software-based encoders. They're really only there for on-the-fly encoding like streaming to twitch or recording security camera footage. But if you're encoding your precious family memories or backing up your DVD collection, you want to use software encoders. Therefore, if a hypothetical software h264 encoder could be faster on the GPU, it would have value for anyone doing not-on-the-fly encoding of video where they care about the quality.

One source for the software encoder quality claim is the "transcoding" section of this article: https://chipsandcheese.com/i/138977355/transcoding

reply
> ... That being: "Can I decode the next frame(s) in time to show it/them without stuttering".

Except when you are editing video, or rendering output. When you have multiple streams of very high definition input, you definitely need much more than realtime speed decoding of a single video.

And you would want to scrub around the video(s), jumping to any timecode, and get the target frame preferably showing as soon as your monitor refreshes.

reply
I think it's mostly because most cpus that can run a gpu already have parts dedicated as h264 encoder, way more efficient energy wise and speed wise.
reply
This is literally what the article is about. It answers your questions.
reply
A GPU's job is to take inputs at some resolution, transform it, and then output it at that resolution. H.264/H.265 (and really, any playback format) needs a fundamentally different workflow: it needs to take as many frames as your framerate is set to, store the first frame as a full frame, and then store N-1 diffs, only describing which pixels changed between each successive frame. Something GPUs are terrible at. You could certainly use the GPU to calculate the full frame diff, but then you still need to send it back to the CPU or dedicated encoding hardware that turns that into an actual concise diff description. At that point, you might as well make the CPU or hardware encoder do the whole job, you're just not saving any appreciable time by sending the data over to the GPU first, just to get it back in a way where you're still going over every pixel afterwards.
reply
[dead]
reply