I kind of hate the idea, but you probably could do a lazy LLM check of every paper and every citation and have it flag possible wrong (second sense) citations for human review
But you'd need a LOT of tokens and a LOT of human-hours
And then what, we're done? How have we avoided the need for the same exhaustive human review? It only saves human review time if you trust the LLM not to miss things.
An LLM could replace the random sampling. It doesn't need to be particularly good for the approach to provide value. I would worry about LLM bias though.
Another thing to consider is that readers can detect fake citations after publication, report to arXiv, and the author gets banned.