Everyone’s (well, except Anthropic, they seem to have preserved a bit of taste) approach is the more data the better, so the databases of stolen content (erm, models) are memorizing crap.
This was a compromise of the library owners github acccounts apparently, so this is not a related scenario to dangerous code in the training data.
I assume most labs don't do anything to deal with this, and just hope that it gets trained out because better code should be better rewarded in theory?
Yeah, and that won't hold up for long. Just wait until some well resourced attacker replicates their exploit into tens of thousands of sources it knows will be scraped and included in the training set to bias the model to produce their vulnerable code. Only a matter of time.