undefined

points

[-]

Fine-tuning does exist on the major model providers, and presumably already uses LoRA. (Not sure though.)

We saw last year that it's remarkably easy to bypass safety filters by fine-tuning GPT, even when the fine-tuning seems innocuous. e.g. the paper about security research finetuning (getting the model to add vulnerabilities) producing misaligned outputs in other areas. It seems like it flipped some kind of global evil neuron. (Maybe they can freeze that one during finetuning? haha)

Found it: Emergent Misalignment

https://news.ycombinator.com/item?id=43176553

https://news.ycombinator.com/item?id=44554865