I sick Opus, GPT5.4, and Gemini on it, have them write their own hitlists, and then have a warden Opus instance go and try to counterprove the findings, and compose a final hitlist for me, then a fresh context instance to go fix the hitlist.
They always find some little niggling thing, or inconsistency, or code organization improvement. They absolutely introduce more churn than is necessary into the codebase, but the things they catch are still a net positive, and I validate each item on the final hitlist (often editing things out if they're being overeager or have found a one in a million bug that's just not worth the fix (lately, one agent keeps getting hung up on "what if the device returns invalid serial output" in which case "yeah, we crash" is a perfectly fine response)).