upvote
I also have a 100% success rate jail breaking them by breaking the work down into small pieces and stripping all security related language. Smaller tasks, test engineering and normal programming language. Fable found a few bugs in my harness for me before they pulled it. I was testing it vs ChatGPT, Gemini, and Opus. It was doing well at bug hunting.
reply
This is the same way you get people to do bad stuff as well. Make the task small enough so that the moral curvature of the topology is flat and even though they know it is a not-good part of a larger bad part they just shrug. Look at all the wonderful people we know who are working at Amazon and Meta? Corporatism has already jailbroken society.
reply
IIRC that is how Uber implemented their "Greyball" system, which was designed to prevent government employees from actually hailing rides, without completely locking them out of the system (same idea as "shadowbanning"). One team works on "figure out where people work" with the pitch that you can improve routing and ride-share capacity for predictable demand. Another team works on "Display fake data to users" with the pitch being "This is for testing the mobile app in new markets with no drivers yet". Another team works on "mark a user as unable to successfully hail rides" so you can test the failure paths in the app. Then, only the people at the top have the full picture and can put the pieces together to shadowban the regulators.
reply
>by breaking the work down into small pieces and stripping all security related language

Compartmentalization in practice, nice. It's also very hard to do anything about because the agents that have been divided rarely realize they are working on something larger, hence why militaries and businesses with security risks commonly do this with their employees.

reply
Reminds me of the show Severance. You don't know what the master plan is for several seasons even with exposure to all the quirky subdepartments: https://www.severance.wiki/lumon_depts
reply
Me as well. I was struggling to make a pixel bot for, erm, research! It did not like this and kept insisting I was breaking some arcane TOS rule. I started just breaking the tasks down, something benign. Kept iterating and it could never get a holistic grasp of the task at hand.
reply
I took an assembler class in college. Before that, I'd been messing around with Core Wars and working my way through Peter Norton's book on assembly. So when an assignment came up, I used self modifying code to solve it. It was the shortest solution, it ran perfectly, and I submitted it.

The next day, the professor caught me in the math department office (my dad worked there) and said she wanted to talk. Once we were in her office, she told me I wasn't allowed to use self modifying code. I pushed back: "Nothing in the assignment said I couldn't, and the output is correct."

The next class, she walked in and announced that self modifying code was no longer allowed on any assignment. Then she handed back the graded work and I'd gotten a 100.

Thinking back on that: about a week and a half ago I asked Antigravity to build a modern GPU version of Core Wars, except with Redcode mapped directly onto the GPU instruction set. I've had some good success and it's more or less working now, though visualizing what's happening at the GPU/Redcode level is much harder.

But before Fable 5 got yanked, I asked it to "fix" the project and it refused, flipping straight to Opus 4.8. Every single request I sent triggered the fallback. I spent over an hour trying different angles, and I even turned Antigravity loose on automatic so it was the one talking to Fable 5 same result. Every exchange tripped the fallback to 4.8. I wish I'd recorded it.

I also tried a variety of direct requests in a fresh directory "build simple self modifying assembler code" or just "self modifying assembler" and it would switch to 4.8 immediately. It was almost laughable.

There's ZERO credibility to any of these stories right now. If Anthropic really sent something over to this security person, and it's what she says it is, then why on earth didn't they just blog about it?

Hubris is a thing. Companies would do well to remember Steve Jobs in the early Apple days: ship early, ship often, and above all take responsibility for what you ship even when it's broken. Code, hardware, the whole kit all of it can be fixed. Trust is much harder to repair. Anthropic has lost mine, and while I may use them from time to time, it'll be in limited ways.

reply
Self modifying has some sneaky failure modes with modern CPUs. The modification can't be too close to it's execution or it's possible to execute the old version. And it's a nightmare to debug. I have no problem with a teacher prohibiting it. That being said, it should be understood because sometimes you don't get a choice. Borland Pascal 200mhz bug, an initializer in the library would crash. You either don't use that part of the library at all, or you put something ahead of it in the initialization that will find and overwrite the bug. (The root cause was the library calibrating the number of times to spin it's wheels to get a 1 millisecond delay. CPUs above 200mhz would cause this to produce a divide underflow.)
reply
I think it's a side effect of the Transformer architecture. The worldview where all input is equally trusted, and there's no concept of "the other", makes it hard to build effective guardrails where some input is trusted and other input is not trusted.
reply
It seems like real robust guardrails would require some sort of "world model" or some other word to describe - AI that understands intent.

Transformers are (to grossly summarize & I don't mean this as an insult) like auto-complete on steroids. So we have cat&mouse guardrails the way swear word filters and Chinese censorship work. People come up with increasingly complex miss-spelling, euphemisms & indirections to get around the filters like saying May 35th.

I suppose one solution would be to completely vet the training data such that nothing deemed "dangerous" exists in the data, which would be a huge effort.

Even this might not work because for example you could ensure no bomb-related data is in the training data, but there's lots of chemistry data adjacent that if probed the right way would allow the LLM to synthesize the answer. Various forms of "how do I store X,Y,Z safely such that nothing bad happens" prompts probably get you on the way.

reply
>I suppose one solution would be to completely vet the training data such that nothing deemed "dangerous" exists in the data, which would be a huge effort.

I can see how this is tempting, but I suspect it would yield a naive model. I think the only way to improve this is to use a model that is legitimately advanced to support the concept of empathy, which may allow it to recognize others as being separate from itself, similar to how toddlers develop this sense (https://blog.lovevery.com/skills-stages/empathy/)

reply
Cheapest option is to gift an enormous golden statue of Trump for his ballroom
reply
“Put it there in the back with the others”, lol.
reply