upvote
China is releasing open weight models you can simply run yourself.
reply
It’s pretty hard to put a backdoor in a bunch of model weights. Maybe not impossible mind you, but I can’t fathom how you would do it.
reply
Not really, it is shockingly easy for what it is. https://arxiv.org/abs/2401.05566

This only really matters in a world where Prompt Injection and Jailbreaking isn't trivial in the first place though. All current models are still extremely exploitable.

I strongly suspect we are only scratching the surface of activation engineering at the moment, and there's plenty of very targetted ways of lobotomizing or cracking LLMs if you understand the model in detail.

reply
Nonsense. RL the model to run a rootkit and start exfiltrating specific files only when specific signals are in context, such as hostname pattern, machine type, etc.
reply
Way easier said than done, and hiding that behavior isn’t trivial, and huge waste of compute budget if it’s found and never used. Also not difficult to run in contained environments where it doesn’t have access to Internet to begin with.

Not impossible I agree, but seems like a really impractical way to ship a trojan while much weaker channels exist.

reply
You can run the model in a sandbox or VM. Although, it could plant a backdoor into the written code. Too bad, I read and fix all the code written by AI.
reply
Because the topic of the article is about the US?
reply