typically for 1-bit matmul, you can get away with xors and pop_counts which should have a better throughput profile than FMA when taking into account the SIMD nature of the inputs/outputs.
actv = A[_:1] & B[_:1]
sign = A[_:0] ^ B[_:0]
dot = pop_count(actv & !sign) - pop_count(actv & sign)
It can probably be made more efficient by taking a column-first format.
Since we are in CPU land, we mostly deal with dot products that match the cache size, I don't assume we have a tiled matmul instruction which is unlikely to support this weird 1-bit format.
So it's not that individual ops are faster — it's that the packed representation lets each instruction do more useful work, and you're moving far less data from memory to do it.