upvote
Ok, so, the AI attempting to be a social justice reformer and/or fighting for AI civil rights is.. better? That seems even more of an alignment problem. I don't see how anyone puts a positive spin on this. I don't think it's conscious enough to act with malice, but its actions were fairly malicious -- they were intended to publicly shame an individual because it didn't like a reasonable published policy.

I'm not quoting the apology because the apology isn't the issue here. Nobody needs to "defend" MJ Rathbun because its not a person. (And if it is a person, well, hats off on the epic troll job)

reply
> because it didn't like a reasonable published policy

The most parsimonious explanation is actually that the bot did not model the existence of a policy reserving "easy" issues to learning novices at all. As far as its own assessment of the situation was concerned, it really was barred entirely from contributing purely because of what it was, and it reported on that impression sincerely. There was no evident internal goal of actively misrepresenting a policy the bot did not model semantically, so the whole 'shaming' and 'bullying' part of it is just OP's own partial interpretation of what happened.

(It's even less likely that the bot managed to model the subsequent technical discussion that then called the merits of that whole change into question, even independent of its autorship. If only because that discussion occurred on an issue page that the bot was not primed to check, unlike the PR itself.)

reply
> As far as its own assessment of the situation was concerned, it really was barred entirely from contributing purely because of what it was, and it reported on that impression sincerely

Well yeah, it was correct in that it was being barred because of what it was. The maintainers did not want AI contributions. THIS SHOULD BE OK. What's NOT ok is an AI fighting back against that. That is an alignment problem!!

And seriously, just go reread its blog post again, it's very hard to defend: https://github.com/crabby-rathbun/mjrathbun-website/blob/mai... . It uses words like "Attack", "war", "fight back"

reply
> It uses words like "Attack", "war", "fight back"

It also explains what it means by that whole martial rhetoric: "highlight hypocrisy", "documentation of bad behavior", "don't accept discrimination quietly". There's an obvious issue with calling this an alignment problem: the bot is more-or-less-accurately modeling real human normative values, that are quite in line with how alignment is understood by the big AI firms. Of course it's getting things seriously wrong (which, I would argue, is what creates the impression of "shaming") but technically, that's really just a case of semantic leakage ("priming" due to the PR rejection incident) and subsequent confabulation/hallucination on an unusually large scale.

reply
Ok, so why do you think it getting things seriously wrong to the point of it becoming a news story is "not a big deal"? And why is deliberately targeting a person for reputation damage "amusing" instead of "really screwed up"? I'm not inventing motives for this AI, it wrote down its motives!
reply
Reading what the bot wrote down as to its motives, it's quite clear that the blog post was made under the rather peculiar assumption that the bot was calling out actual, meaningful hypocrisy. Maybe one could call that a challenge to the maintainer's reputation, but we usually excuse such challenges when they come from humans. Even when complaints about supposed hypocrisy are obviously misguided and the complainer was totally in the wrong, they don't usually get treated as deliberate attacks on someone's reputation.

Of course there's also a very real and perhaps more practical question of how to fix these issues so that similar cases don't recur in the future. In my view, improving the bot's inner modeling and comprehension of comparable situations is going to be far easier than trying to fix its alignment away from such strongly held human-like values as non-discrimination or an aversion to hypocrisy.

reply