Basically it does not need dedicated hw acceleration because it can use generic vector instructions to reach similar speeds. I wonder how true that is though.
No one suggests the negotiated mess that exists in most standards. A single binary switch to account for hardware acceleration when it's available on both ends would have been a good decision.