upvote
Haven't looked closely, but on modern x86 CPUs it might be possible to do much better with the gf2affineqb instructions, which let us do 8x8 bit matrix multiplications efficiently. Not sure how you'd handle the 2-bit part, of course.
reply