undefined

points

[-]

Bu that's not really what danlitt said, right? They did not claim that it's impossible for an LLM to generate something different, merely that it's not a clean room implementation since the LLM, one must assume, is trained on the code it's re-implementing.

by galaxyLogic8 hours ago|

parent|

[-]

BUt LLM has seen millions (?) of other code-bases too. If you give it a functional spec it has no reason to prefer any one of those code-bases in particular. Except perhaps if it has seen the original spec (if such can be read from public sources) associated with the old implementation, and the new spec is a copy of the old spec.

by sarchertech6 hours ago|

parent|

[-]

Yes if you are solving the exact problem that the original code solved and that original code was labeled as solving that exact problem then that’s very good reason for the LLM to produce that code.

Researchers have shown that an LLM was able to reproduce the verbatim text of the first 4 Harry Potter books with 96% accuracy.

by 0x4572 hours ago|

parent|

[-]

> that an LLM was able to reproduce the verbatim text of the first 4 Harry Potter books with 96% accuracy.

Kinda weird argument, in their research (https://forum.gnoppix.org/t/researchers-extract-up-to-96-of-...) LLM was explicitly asked to reproduce the book. There are people that can do so without LLMs out there, by this logic everything they write is a copyright infringement an every book they can reproduce.

> Yes if you are solving the exact problem that the original code solved and that original code was labeled as solving that exact problem then that’s very good reason for the LLM to produce that code.

I think you're overestimating LLM ability to generalize.

by galaxyLogic4 minutes ago|

parent|

[-]

I guess the text of Harry Potter was used as training material as one big chunk. That would be a copyright violation.

by pmarreck3 hours ago|

parent|

prev|

[-]

This is not an argument against coding in a different language, though. It would be like having it restate Harry Potter in a different language with different main character names, and reshuffled plot points.

by pmarreck3 hours ago|

parent|

prev|

[-]

Well, if you’re coding it in Zig, and it’s barely seen any Zig, then how exactly would that argument hold up in that case?

by airza12 hours ago|

prev|

[-]

By what means did you make sure your LLM was not trained with data from the original source code?

by MrManatee7 hours ago|

parent|

[-]

Exactly - it very likely was trained on it. I tried this with Opus 4.6. I turned off web searches and other tool calls, and asked it to list some filenames it remembers being in the 7-zip repo. It got dozens exactly right and only two incorrect (they were close but not exact matches). I then asked it to give me the source code of a function I picked randomly, and it got the signature spot on, but not the contents.

My understanding of cleanroom is that the person/team programming is supposed to have never seen any of the original code. The agent is more like someone who has read the original code line by line, but doesn't remember all the details - and isn't allowed to check.

by pmarreck3 hours ago|

parent|

prev|

[-]

Because it’s written in an entirely different language, which makes this whole point moot

by sobjornstad2 hours ago|

parent|

[-]

Surely if I took a program written in Python and translated it line for line into JavaScript, that wouldn't allow me to treat it as original work. I don't see how this solves the problem, except very incrementally.

by pmarreck22 minutes ago|

parent|

[-]

but it’s not a line for line translation. it is a functionality for functionality translation, and sometimes very differently.

by danlitt8 hours ago|

prev|

[-]

I only said the probability is higher, not that the probability is 1!