upvote
> how can a2mark ensure that AI did NOT do a clean-room conforming rewrite?

In cases like this it is usually incumbent on the entity claiming the clean-room situation was pure to show their working. For instance how Compaq clean-room cloned the IBM BIOS chip¹ was well documented (the procedures used, records of comms by the teams involved) where some other manufacturers did face costly legal troubles from IBM.

So the question is “is the clean-room claim sufficiently backed up to stand legal tests?” [and moral tests, though the AI world generally doesn't care about failing those]

--------

[1] the one part of their PCs that was not essentially off-the-shelf, so once it could be reliably legally mimicked this created an open IBM PC clone market

reply
Turns out there’s no need to speculate. Someone pointed out on GH [0] that the AI was literally prompted to copy the existing code:

> *Context:* The registry maps every supported encoding to its metadata. Era assignments MUST match chardet 6.0.0's `chardet/metadata/charsets.py` at https://raw.githubusercontent.com/chardet/chardet/f0676c0d6a...

> Fetch that file and use it as the authoritative reference for which encodings belong to which era. Do not invent era assignments.

[0] https://github.com/chardet/chardet/issues/327#issuecomment-4...

reply
That's data, not code.
reply
It’s a python file from chardet 6, doesn’t matter what you think it does. It clearly wasn’t a clean room reimplementation.
reply
deleted
reply
The foundation model probably includes the original project in its training set, which might be enough for a court to consider it “contaminated”. Training a new foundation model without it is technically possible, but would take months and cost millions of dollars.
reply
Clean room is sufficient, but not necessary to avoid the accusations of license violation.

a2mark has to demonstrate that v7 is "a work containing the v6 or a portion of it, either verbatim or with modifications and/or translated straightforwardly into another language", which is different from demanding a clean-room reimplementation.

Theoretically, the existence of a publicly available commit that is half v6 code and half v7 can be used to show that this part of v7 code has been infected by LGPL and must thus infect the rest of v7, but that's IMO going against the spirit of the [L]GPL.

reply
Please don't use loaded terms like "infect". The license does not infect, it has provisions and requirements. If you want to interact with it, you either accept them or don't use the project. In this case, the author of v7 is trying to steal the copyrighted work of other authors by re-licensing it illegally.
reply
Is their work present in v7?
reply
Yes. The AI operator posted this as the prompt: https://github.com/chardet/chardet/commit/f51f523506a73f89f0...

which, minimally instructs it to directly examine the test suite: `4. High encoding accuracy on the chardet test suite`

reply
So what? Is reading code the same as copying code or modifying existing code?
reply
If you want to prove you did not make a derivative work, yes it helps if you never read the source code. Hence so call "clean room" implementations.
reply
Why should I prove that? Let those who claim the violation prove that.
reply
There is plenty of evidence already. The claim has been substantiated.

You can't just dismiss it then say the claimant has to provide proof.

reply
Yes. Commits clearly show in progress where both LGPL and MIT code was working together. This clearly show they are a derivative work and MUST follow the original license.

Plus the argument put forth is that they can re-license the project. It's not a new one made from scratch.

reply
Did they eventually remove/replace all the LGPL code?
reply
So, if these commits were private and squashed together before 7.0 was published there would be no violation?
reply
The commits being public or not does not change the fact the developement was made as a derivative work of the original version.
reply
They would be concealing the violation.
reply
Consider TCC relicensing. They identified the files touched by contributors that wanted to keep the GPL license and reimplemented them. No team A/team B clean room approach used. The same happened here, but at a different scale. All files now have a new author and this author is free to change the license of his work.
reply
I think the problem here is that an AI is not a legal entity. It doesn't matter if you as individual run an AI that takes the source, dumps out a spec that you then feed into another AI. The legal liability lies with the operator of the AI, the original copyleft license was granted to a person, not to a robot.

Now if you had 2 entirely distinct humans involved in the process that might work though.

reply