upvote
I guess it depends on if the source data set is part of the training data or not (if it's open source it is likely part of it).

A lawyer could easily argue that the model itself stores a representation of the original, and thus it can never do a "fresh context".

And to be perfectly honest, LLMs can quote a lot of text verbatim.

reply
The new agent who writes code has probably at least parts of the original code as training data.

We can't speak about clean room implementation from LLM since they are technically capable only of spitting their training data in different ways, not of any original creation.

reply
The conclusion of this would be that you can never license AI generated code since you can't get a release from the original authors.

Of course in practice it would work exactly in the opposite fashion and AI generated code would be immune even if it copied code verbatim.

reply
I don't see what's wrong with that personally. If I pirated someone's software, and then sold it as my own and got caught, just because I sold a bunch of it doesn't mean those people who bought it now are in the clear. They are still using bootleg software in their business.
reply
Only in the case of open source code
reply
How do you prove the training data didn't contain the code?

I'd assume an LLM trained on the original would also be contaminated.

reply