Things have changed a lot in the past six weeks.
Gemini 2.5 Pro accepts a million tokens and can "reason" with them, which means you can feed it hundreds of thousands of lines of code and it has a surprisingly good chance of figuring things out.
OpenAI released their first million token models with the GPT 4.1 series.
OpenAI o3 and o4-mini are both very strong reasoning code models with 200,000 token input limits.
These models are all new within the last six weeks. They're very, very good at working with large amounts of crufty undocumented code.
Maybe in a generation or two codebases will become more uniform and predictible if fewer humans do it by hand. Same with self driving cars, if there were no human drivers out there the problem would become trivial to conquer.
They still make mistakes, and yeah they're still (mostly) next token predicting machines under the hood, but if your mental model is "they can't actually predict through how some code will execute" you may need to update that.