If you disagree with that, I think the onus is on you to show me that an LLM could simulate the full context in which a user interfaces with software. That's a ridiculous claim.
Feel free to show literally any evidence for this claim.
I don't think it's feasible to fully simulate the full depth of actual usage, given that (especially in the case of screen readers and the like) there's a great deal of combinatorial depth and context to the problem. Which screen readers, on which operating systems, and which users thereof?