I'm inclined to use DeepSeek V4 Pro the most, because it is consistently extremely strong, it's very fast, it's very cheap and has excellent caching and cheap-as-free cached input tokens (something like 80% of token usage is cached when I'm using it for security scanning). So, my probably "pair" of frontline security researchers will probably be DeepSeek V4 Pro and Gemma 4 31B self-hosted (another shockingly strong contender, competitive with the best models once you let it loop on the same file a couple/few times). But, I won't be surprised if GLM 5.2 turns out better than DeepSeek V4 Pro...it costs quite a bit more.
I've benchmarked it, and the "here's a repo, find bugs" approach finds far fewer bugs. Like, dramatically fewer. Models are good and contexts have expanded, but focus still wins with hard problems. You could probably tell the good models to make a plan to audit the repo, and it would end up making its own "loop" in the form of a checklist of files to look at over several sessions or via subagents, I assume.
Not sure if helpful but in my experience when something a bit more complex needs to be done, manually making it read the context I know the model will need for it to solve it well (like making it consume all the project docs first) helps with getting a more satisfactory result instead of only giving it the task and let it look around and consume the context it thinks it needs.
Will test your bug finding method in a current project of mine both with my "manual" context preloading and without.