upvote
I’m not convinced that Grok’s dataset must contain CSAM for it to generate CSAM. Surely a combination of nude adults and clothed children would allow for it to synthesize CSAM?

(Disclaimer: I’m not in favor of AI in general and definitely not in favor of what Grok is doing specifically. I’m just entirely sold on the claim that its dataset must contain CSAM, though I think it is probably likely that it has at least some, because cleaning up such a massive dataset carefully and thoroughly costs money that Elon wouldn’t want to spend.)

reply
Agree that this makes it unlikely we see frontier training data OS'd but this is a separate problem from software and infrastructure transparency, which has none of those constraints. Training stack, the parallelism decisions, documented failure modes are engineering knowledge and there's no principled reason it doesn't ship.
reply
The human labor aspect is very little discussed and essential and very abusive, I am sure.

People think of these models as "magic" and "science" but they do not realize the immense amount (in human years) of clicking yes/no in front of thousands of pairs of input/outputs.

I worked for some months as a Google Quality Rater (wow), and know the job. This must be much worse.

reply
I agree full transparency on data adds several other challenges. Still, even releasing the software and infrastructure aspects would be a huge step from where we are now. Also, some recent work has shown pretraining filtering to be possible and beneficial which could help mitigate some concerns of sensitive data in the datasets.
reply