Compartmentalization in practice, nice. It's also very hard to do anything about because the agents that have been divided rarely realize they are working on something larger, hence why militaries and businesses with security risks commonly do this with their employees.
The next day, the professor caught me in the math department office (my dad worked there) and said she wanted to talk. Once we were in her office, she told me I wasn't allowed to use self modifying code. I pushed back: "Nothing in the assignment said I couldn't, and the output is correct."
The next class, she walked in and announced that self modifying code was no longer allowed on any assignment. Then she handed back the graded work and I'd gotten a 100.
Thinking back on that: about a week and a half ago I asked Antigravity to build a modern GPU version of Core Wars, except with Redcode mapped directly onto the GPU instruction set. I've had some good success and it's more or less working now, though visualizing what's happening at the GPU/Redcode level is much harder.
But before Fable 5 got yanked, I asked it to "fix" the project and it refused, flipping straight to Opus 4.8. Every single request I sent triggered the fallback. I spent over an hour trying different angles, and I even turned Antigravity loose on automatic so it was the one talking to Fable 5 same result. Every exchange tripped the fallback to 4.8. I wish I'd recorded it.
I also tried a variety of direct requests in a fresh directory "build simple self modifying assembler code" or just "self modifying assembler" and it would switch to 4.8 immediately. It was almost laughable.
There's ZERO credibility to any of these stories right now. If Anthropic really sent something over to this security person, and it's what she says it is, then why on earth didn't they just blog about it?
Hubris is a thing. Companies would do well to remember Steve Jobs in the early Apple days: ship early, ship often, and above all take responsibility for what you ship even when it's broken. Code, hardware, the whole kit all of it can be fixed. Trust is much harder to repair. Anthropic has lost mine, and while I may use them from time to time, it'll be in limited ways.
Transformers are (to grossly summarize & I don't mean this as an insult) like auto-complete on steroids. So we have cat&mouse guardrails the way swear word filters and Chinese censorship work. People come up with increasingly complex miss-spelling, euphemisms & indirections to get around the filters like saying May 35th.
I suppose one solution would be to completely vet the training data such that nothing deemed "dangerous" exists in the data, which would be a huge effort.
Even this might not work because for example you could ensure no bomb-related data is in the training data, but there's lots of chemistry data adjacent that if probed the right way would allow the LLM to synthesize the answer. Various forms of "how do I store X,Y,Z safely such that nothing bad happens" prompts probably get you on the way.
I can see how this is tempting, but I suspect it would yield a naive model. I think the only way to improve this is to use a model that is legitimately advanced to support the concept of empathy, which may allow it to recognize others as being separate from itself, similar to how toddlers develop this sense (https://blog.lovevery.com/skills-stages/empathy/)