Let's say you tell it that there might be small backdoors. You've now primed the LLM to search that way (even using "may"). You passed information about the test to test taker!
So we have a new variable! Is the success only due to the hint? How robust is that prompt? Does subtle wording dramatically change output? Does "may", "does", "can", "might" work but "May", "cann", or anything else fail? Have you the promoter unintentionally conveyed something important about the test?
I'm sure you can prompt engineer your way you greater success but by doing so you also greatly expand the complexity of the experiment and consequently make your results far less robust.
Experimental design is incredibly difficult due to all the subtleties. It's a thing most people frequently fail at (including scientists) and even more frequently fool themselves into believing stronger claims than the experiment can yield.
And before anyone says "but humans", yeah, same complexity applies. It's actually why human experimentation is harder than a lot of other things. There's just far more noise in the system.
But could you get success? Certainly. I mean you could tell it exactly where the backdoors are. But that's not useful. So now you got to decide where that line is and certainly others won't agree.
But when we're trying to share results, "a talented engineer sat with the thread and wrote tests/docs/harnesses to guide the model" is less impressive than "we asked it and it figured it out," even though the latter is how real work will happen.
It creates this perverse scenario (which is no one's fault!) where we talk about one-shot performance but one-shot performance is useful in exactly 0 interesting cases.
Sometimes it feels like it's not dissimilar to spending 4 hours to automate a 10 minute task that I thought I'll need forever but ended up just using it once in the past 5 months. But sometimes I unlock something that saves a huge amount of time, and can be reused in many steps of other projects.
Even where it works, it is quite hard to specify human strategic thinking in a way that an AI will follow.