upvote
Bu that's not really what danlitt said, right? They did not claim that it's impossible for an LLM to generate something different, merely that it's not a clean room implementation since the LLM, one must assume, is trained on the code it's re-implementing.
reply
BUt LLM has seen millions (?) of other code-bases too. If you give it a functional spec it has no reason to prefer any one of those code-bases in particular. Except perhaps if it has seen the original spec (if such can be read from public sources) associated with the old implementation, and the new spec is a copy of the old spec.
reply
Yes if you are solving the exact problem that the original code solved and that original code was labeled as solving that exact problem then that’s very good reason for the LLM to produce that code.

Researchers have shown that an LLM was able to reproduce the verbatim text of the first 4 Harry Potter books with 96% accuracy.

reply
> that an LLM was able to reproduce the verbatim text of the first 4 Harry Potter books with 96% accuracy.

Kinda weird argument, in their research (https://forum.gnoppix.org/t/researchers-extract-up-to-96-of-...) LLM was explicitly asked to reproduce the book. There are people that can do so without LLMs out there, by this logic everything they write is a copyright infringement an every book they can reproduce.

> Yes if you are solving the exact problem that the original code solved and that original code was labeled as solving that exact problem then that’s very good reason for the LLM to produce that code.

I think you're overestimating LLM ability to generalize.

reply
I guess the text of Harry Potter was used as training material as one big chunk. That would be a copyright violation.
reply
This is not an argument against coding in a different language, though. It would be like having it restate Harry Potter in a different language with different main character names, and reshuffled plot points.
reply
Well, if you’re coding it in Zig, and it’s barely seen any Zig, then how exactly would that argument hold up in that case?
reply
By what means did you make sure your LLM was not trained with data from the original source code?
reply
Exactly - it very likely was trained on it. I tried this with Opus 4.6. I turned off web searches and other tool calls, and asked it to list some filenames it remembers being in the 7-zip repo. It got dozens exactly right and only two incorrect (they were close but not exact matches). I then asked it to give me the source code of a function I picked randomly, and it got the signature spot on, but not the contents.

My understanding of cleanroom is that the person/team programming is supposed to have never seen any of the original code. The agent is more like someone who has read the original code line by line, but doesn't remember all the details - and isn't allowed to check.

reply
Because it’s written in an entirely different language, which makes this whole point moot
reply
Surely if I took a program written in Python and translated it line for line into JavaScript, that wouldn't allow me to treat it as original work. I don't see how this solves the problem, except very incrementally.
reply
but it’s not a line for line translation. it is a functionality for functionality translation, and sometimes very differently.
reply
I only said the probability is higher, not that the probability is 1!
reply