undefined

[-]

Its very important to understand the "how" it was done. The GPL hands the "compile" step, and the result is still GPL. The clean Room process uses 2 teams, separated by a specification. So you would have to

1. Generate specification on what the system does. 2. Pass to another "clean" system 3. Second clean system implements based just on the specification, without any information on the original.

That 3rd step is the hardest, especially for well known projects.

by microtonal18 hours ago|

[-]

So what if a frontier model company trains two models, one including 50% of the world's open source project and the second model the other 50% (or ten models with 90-10)?

Then the model that is familiar with the code can write specs. The model that does not have knowledge of the project can implement them.

Would that be a proper clean room implementation?

Seems like a pretty evil, profitable product "rewrite any code base with an inconvenient license to your proprietary version, legally".

[-]

LLM training is unnecessary in what we're discussing. Merely LLM using: original code -> specs as facts -> specs to tests -> tests to new code.

by microtonal9 hours ago|

[-]

It is hard to prove that the model doesn't recognize the tests and reproduces the memoized code. It's not a clean room.

[-]

1 is claude-code1, outputs tests as text.

2. Dumped into a file.

3. claude-code that converts this to tests in the target language, and implements the app that passes the tests.

3 is no longer hard - look at all the reimplementations from ccc, to rewrites popping up. They all have a well defined test suite as common theme. So much so that tldraw author raised a (joke) issue to remove tests from the project.

by hrmtst9383710 hours ago|

[-]

Treating an AI-assisted rewrite as a legal bypass for the GPL is wishful thinking. A defensible path is a documented clean-room reimplementation where a team that never saw the GPL source writes independent specs and tests, and a separate team implements from those specs using black-box characterization and differential testing while you document the chain of custody.

AI muddies the water because large models trained on public repos can reproduce GPL snippets verbatim, so prompting with tests that mirror the original risks contamination and a court could find substantial similarity. To reduce risk use black-box fuzzing and property-based tools, have humans review and scrub model outputs, run similarity scans, and budget for legal review before calling anything MIT.

by AberrantJ8 hours ago|

[-]

I'm somewhat confused on how it actually muddies the waters - any person could have read the source code before hand and then either lied about it or forgot.

Our knowledge of what the person or the model actually contains regarding the original source is entirely incomplete when the entire premise requires there be full knowledge that nothing remains.

by nairboon19 hours ago|

[-]

No, GPL still holds even if you transform the source code from one language to another language.

[-]

That why I carved it out to just the specs. If they can be read as "facts", then the new code is not derived but arrived at with TTD.

The thesis I propose is that tests are more akin to facts, or can be stated as facts, and facts are not copyright-able. That's what makes this case interesting.

by nairboon18 hours ago|

[-]

I assumed that "tests" refers to a program too, which in this example is likely GPL. Thus GPL would stick already on the AI-rewrite of GPL test code.

If "tests" should mean a proper specification let's say some IETF RFC of a protocol, then that would be different.