> Now, we are attempting to sandbox something that potentially has the agency and reasoning capabilities to try and get itself out.
The threat model for actual sandboxes has always been "an attacker now controls the execution inside the sandbox". That attacker has agency and reasoning capabilities.
I think a sandbox containing a program should only output data. And that data should conform to a schema. The old difference between programs and data instead of turing-complete languages everywhere.
I have been saying for years that technology increasingly requires the development of memetic firewalls - firewalls that don't just filter based on metadata, but filter based on ideas. Our firewalls need to be at least as capable as the entities it seems to keep out (or in).
That sort of firewall is going to be really expensive to run, to the point that it's a financial DOS vulnerability. What is feasible is simpler algorithms that emit alerts on a baseline pattern match, which then get routed to AI observers after some trigger threshold for mitigation. I wouldn't be surprised if someone has already deployed something like that, TBH.