In a clockless cpu design you'd indeed expect xor to be faster. But in a regular CPU with a clock you either waste a bit of xor performance by making xor and sub both take the same number of ticks, or you speed up the clock enough that the speed difference between xor and sub justifies sub being at least a full tick slower
The former just seems way more practical