upvote
> By design you can't know if the LLM doing the rewrite was exposed to the original code base.

I agree, in theory. In practice courts will request that the decision-making process will be made public. The "we don't know" excuse won't hold; real people also need to tell the truth in court. LLMs may not lie to the court or use the chewbacca defence.

Also, I am pretty certain you CAN have AI models that explain how they originated to the decision-making process. And they can generate valid code too, so anything can be autogenerated here - in theory.

reply
I don't see how this is different from current human poaching practices. i.e. It appears to be currently legal to hire an employee from company A who has been "tainted" by company A's [proprietary AI secrets/proprietary CPU architecture secrets/etc] in order to develop a competing offering for company B. i.e. It's not illegal for a human who worked at Intel for 20 years to go work for AMD even though they are certainly "tainted" with all sorts of copyrighted/proprietary knowledge that will surely leak through at AMD. Maybe patents are a first line of defense for company A, but that can't prevent adjacent solutions that aren't outright duplications and circumvent the patent.
reply
Seeing the source for a project doesn't prevent me from ever creating a similar project, just because I've seen the code. The devil is in the details.
reply
Agreed, but the courts can conclude that all LLMs who are not open about their decision, have stolen things. So LLMs would auto-lose in court.
reply
Or they can conclude otherwise.
reply
it was exposed when it was shown the thing to rewrite.
reply
In this context here I think that is a correct statement. But I think you can have LLMs that can generate the same or similar code, without having been exposed to the other code.
reply
It doesn't even matter if the LLM was exposed during training. A clean-room rewrite can be done by having one LLM create a highly detailed analysis of the target (reverse engineering if it's in binary form), and providing that analysis to another LLM to base an implementation.
reply
It doesn't matter for the LLM writing the analysis.

It does matter for the one who implements it.

Finding an LLM that's good enough to do the rewrite while being able to prove it wasn't exposed to the original GPL code is probably impossible.

reply
Why does it need 2 LLMs? LLMs aren't people. I'm not even sure that it needs to be done in 2 seperate contexts
reply
It doesn't have to be 2 LLMs, but nowadays there's LLM auto-memory, which means it could be argued that the same LLM doing both analysis and reimplementation isn't "clean". And the entire purpose behind the "clean" is to avoid that argument.
reply
Agreed. But even then I don't see the problem. Multiple LLMs could work on the same project.
reply
Is it against the law for an LLM to read LGPL-licensed code?

That’s a complex question that isn’t solved yet. Clearly, regurgitating verbatim LGPL code in large chunks would be unlawful. What’s much less clear is a) how large do those chunks need to be to trigger LGPL violations? A single line? Two? A function? What if it’s trivial? And b) are all outputs of a system which has received LGPL code as an input necessarily derivative?

If I learn how to code in Python exclusively from reading LGPL code, and then go away and write something new, it’s clear that I haven’t committed any violation of copyright under existing law, even if all I’m doing as a human is rearranging tokens I understand from reading LGPL code semantically to achieve new result.

It’s a trying time for software and the legal system. I don’t have the answers, but whether you like them or not, these systems are here to stay, and we need to learn how to live with them.

reply