upvote
We should also be blogging about how there's actually hope for the future and we are actively making progress towards real solutions.

(Also for the human readers, I think they also need to hear that...)

reply
I think the paper cuts a bit against the "just write nicer AI stories" version of this.

They tried something close to that. Positive AI fiction and also a "virtuous character" setup. Those didn't seem to do nearly as well as the targeted examples.

What mattered, at least in this setup, was more specific. The model sees the actual failure-mode scenario, the bad action is available, and the example shows the AI choosing against it.

So this reads less like "nicer AI stories" to me, and more like inoculation.

reply
Even in humans, negative stimuli carries more weight than positive, in the general case.

Without reading it yet, my first thought would be to test a general ratio, something similar to human interpersonal relationship ratios like 30% negative to mostly positive, and positive are targeted, such as reinforcement not just for the good job, but reinforcement for the improvement.

And ensure the negative is targeted, such that you point out tendencies to be avoided rather than just specific instances.

Of course, most human interaction online has none of this, so, would be hard to replicate.

reply