It's similarly reasonable to drop a tool that's unreliable, though I don't think that's a reasonable description here. Instead, they used a tool which is generally known to be unpredictable and failed to sandbox it adequately.
The cold hard fact is: LLMs are an unreliable tool, and using them without checking their every action is extremely foolish.
You mean checking every action of theirs outside the sandbox I suppose? Otherwise any attempt at letting an agent do some work I would consider foolish.