Your key finding is that humans process the grid as one visual scene — but that's a finding about sighted cognition.
Isn't this, like most things, a sensitivity specificity tradeoff?
How many real humans should be blocked from your system to keep the bots out?
What is the Blackstone ratio of accessibility?
I can't believe people are still using this as a generic anti-AI argument even though a decade ago people were insisting that there's no way AI can have the capabilities that frontier LLMs have today. Moreover it's unclear whether the gap even exists. Even if we take the claim that the grid pattern is some sort of fundamental constraint that AI models can't surpass, it doesn't seem too hard to work around by infilling the grids pattern and presenting the 9 images to LLMs as one image.
Is not an anti-AI argument, it’s an open and unsolved question. Your optimism is appreciated, but the dismissal and assumption this is already solved is foolish and naive.