I agree it is not much additional evidence! If someone wanted to try running the same test on a series of N commits from that list including this one I'd be very curious to see the answer!
Realistically, if you are scanning each kernel commit to check if they might be patching a security issue, you are going to be asking an LLM "is this security related, if so vaguely how" with low effort and taking "maybe" as a yes before feeding it to a more expensive model. You aren't trying to establish a probability of an ultimately unknowable fact, there is ground truth that you can find by producing an exploit, so you are just trying to pre-filter before spending the money to find it.
Yeah, ideally we would need the phi coefficient (aka MCC, the binary Pearson correlation), which can be calculated from a confusion matrix of yes/no LLM classifications for all kernel diffs. (Number of true positives, true negatives, false positives, false negatives.)