undefined

[-]

It's not a clean-room implementation, but not because it's trained on the internet.

It's not a clean-room implementation because of this:

> The fix was to use GCC as an online known-good compiler oracle to compare against

by Calavar2 hours ago|

[-]

The classical definition of a clean room implementation is something that's made by looking at the output of a prior implementation but not at the source.

I agree that having a reference compiler available is a huge caveat though. Even if we completely put training data leakage aside, they're developing against a programmatic checker for a spec that's already had millions of man hours put into it. This is an optimal scenario for agentic coding, but the vast majority of problems that people will want to tackle with agentic coding are not going to look like that.

by array_key_first1 hours ago|

[-]

If you read the entire GCC source code and then create a compatible compiler, it's not clean room. Which Opus basically did since, I'm assuming, its training set contained the entire source of GCC. So even if they were actively referencing GCC I think that counts.

by nmilo46 minutes ago|

[-]

What if you just read the entire GCC source code in school 15 years ago? Is that not clean room?

by hex4def620 minutes ago|

[-]

No.

I'd argue that no one would really care given it's GCC.

But if you worked for GiantSodaCo on their secret recipe under NDA, then create a new soda company 15 years later that tastes suspiciously similar to GiantSodaCo, you'd probably have legal issues. It would be hard to argue that you weren't using proprietary knowledge in that case.

by 2 hours ago|

[-]

deleted

by TacticalCoder1 hours ago|

https://en.wikipedia.org/wiki/Clean-room_design

[-]

I'm using AI to help me code and I love Anthropic but I chocked when I read that in TFA too.

It's all but a clean-room design. A clean-room design is a very well defined term: "Clean-room design (also known as the Chinese wall technique) is the method of copying a design by reverse engineering and then recreating it without infringing any of the copyrights associated with the original design."

The "without infringing any of the copyrights" contains "any".

We know for a fact that models are extremely good at storing information with the highest compression rate ever achieved. It's not because it's typically decompressing that information in a lossy way that it didn't use that information in the first place.

Note that I'm not saying all AIs do is simply compress/decompress information. I'm saying that, as commenters noted in this thread, when a model was caught spotting out Harry Potter verbatim, there is information being stored.

It's not a clean-room design, plain and simple.

by mlvljr10 minutes ago|

[-]

[dead]

by 4 hours ago|

[-]

deleted

by 4 hours ago|

[-]

deleted

by inchargeoncall3 hours ago|

[-]

[flagged]

by teaearlgraycold3 hours ago|

[-]

With just a few thousand dollars of API credits you too can inefficiently download a lossy copy of a C compiler!

by antirez4 hours ago|

[-]

The LLM does not contain a verbatim copy of whatever it saw during the pre-training stage, it may remember certain over-represented parts, otherwise it has a knowledge about a lot of things but such knowledge, while about a huge amount of topics, is similar to the way you could remember things you know very well. And, indeed, if you give it access to internet or the source code of GCC and other compilers, it will implement such a project N times faster.

by halxc4 hours ago|

https://arxiv.org/abs/2504.16046

[-]

We all saw verbatim copies in the early LLMs. They "fixed" it by implementing filters that trigger rewrites on blatant copyright infringement.

It is a research topic for heaven's sake:

by RyanCavanaugh4 hours ago|

[-]

The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte. While they are certainly capable of doing some verbatim recitations, this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?) and stored inside the model.

by philipportner3 hours ago|

https://arxiv.org/pdf/2601.02671

[-]

This seems related, it may not be a codebase but they are able to extract "near" verbatim books out of Claude Sonnet.

> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).

by Aurornis2 hours ago|

[-]

Their technique really stretched the definition of extracting text from the LLM.

They used a lot of different techniques to prompt with actual text from the book, then asked the LLM to continue the sentences. I only skimmed the paper but it looks like there was a lot of iteration and repetitive trials. If the LLM successfully guessed words that followed their seed, they counted that as "extraction". They had to put in a lot of the actual text to get any words back out, though. The LLM was following the style and clues in the text.

You can't literally get an LLM to give you books verbatim. These techniques always involve a lot of prompting and continuation games.

by seba_dos12 hours ago|

[-]

> The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte.

The lesson here is that the Internet compresses pretty well.

by mft_2 hours ago|

[-]

(I'm not needlessly nitpicking, as I think it matters for this discussion)

A frontier model (e.g. latest Gemini, Gpt) is likely several-to-many times larger than 500GB. Even Deepseek v3 was around 700GB.

But your overall point still stands, regardless.

by uywykjdskn1 hours ago|

[-]

You got a source on frontier models being maybe half a terabyte. That's not passing the sniff test.

by ben_w4 hours ago|

[-]

We saw partial copies of large or rare documents, and full copies of smaller widely-reproduced documents, not full copies of everything. An e.g. 1 trillion parameter model is not a lossless copy of a ten-petabyte slice of plain text from the internet.

The distinction may not have mattered for copyright laws if things had gone down differently, but the gap between "blurry JPEG of the internet" and "learned stuff" is more obviously important when it comes to e.g. "can it make a working compiler?"

by tza54j3 hours ago|

[-]

We are here in a clean room implementation thread, and verbatim copies of entire works are irrelevant to that topic.

It is enough to have read even parts of a work for something to be considered a derivative.

I would also argue that language models who need gargantuan amounts of training material in order to work by definition can only output derivative works.

It does not help that certain people in this thread (not you) edit their comments to backpedal and make the followup comments look illogical, but that is in line with their sleazy post-LLM behavior.

by ben_w3 hours ago|

[-]

> It is enough to have read even parts of a work for something to be considered a derivative.

For IP rights, I'll buy that. Not as important when the question is capabilities.

> I would also argue that language models who need gargantuan amounts of training material in order to work by definition can only output derivative works.

For similar reasons, I'm not going to argue against anyone saying that all machine learning today, doesn't count as "intelligent":

It is perfectly reasonable to define "intelligence" to be the inverse of how many examples are needed.

ML partially makes up for being (by this definition) thick as an algal bloom, by being stupid so fast it actually can read the whole internet.

by philipportner2 hours ago|

https://arxiv.org/pdf/2601.02671

[-]

Granted, these are some of the most widely spread texts, but just fyi:

> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).

by ben_w2 hours ago|

[-]

Already aware of that work, that's why I phrased it the way I did :)

Edit: actually, no, I take that back, that's just very similar to some other research I was familiar with.

by antirez4 hours ago|

[-]

Besides, the fact an LLM may recall parts of certain documents, like I can recall incipits of certain novels, does not mean that when you ask LLM of doing other kind of work, that is not recalling stuff, the LLM will mix such things verbatim. The LLM knows what it is doing in a variety of contexts, and uses the knowledge to produce stuff. The fact that for many people LLMs being able to do things that replace humans is bitter does not mean (and is not true) that this happens mainly using memorization. What coding agents can do today have zero explanation with memorization of verbatim stuff. So it's not a matter of copyright. Certain folks are fighting the wrong battle.

by shakna2 hours ago|

[-]

During a "clean room" implementation, the implementor is generally selected for not being familiar with the workings of what they're implementing, and banned from researching using it.

Because it _has_ been enough, that if you can recall things, that your implementation ends up not being "clean room", and trashed by the lawyers who get involved.

I mean... It's in the name.

> The term implies that the design team works in an environment that is "clean" or demonstrably uncontaminated by any knowledge of the proprietary techniques used by the competitor.

If it can recall... Then it is not a clean room implementation. Fin.

by boroboro43 hours ago|

[-]

While I mostly agree with you, it worth noting modern llms are trained on 10-20-30T of tokens which is quite comparable to their size (especially given how compressible the data is)

by Aurornis2 hours ago|

[-]

Simple logic will demonstrate that you can't fit every document in the training set into the parameters of an LLM.

Citing a random arXiv paper from 2025 doesn't mean "they" used this technique. It was someone's paper that they uploaded to arXiv, which anyone can do.

by soulofmischief3 hours ago|

[-]

The point is that it's a probabilistic knowledge manifold, not a database.

by PunchyHamster3 hours ago|

[-]

we all know that.

by soulofmischief1 hours ago|

[-]

Unfortunately, that doesn't seem to be the case. The person I replied to might not understand this, either.

by PunchyHamster3 hours ago|

[-]

So it will copy most code with adding subtle bugs

by modeless5 hours ago|

[-]

There seem to still be a lot of people who look at results like this and evaluate them purely based on the current state. I don't know how you can look at this and not realize that it represents a huge improvement over just a few months ago, there have been continuous improvements for many years now, and there is no reason to believe progress is stopping here. If you project out just one year, even assuming progress stops after that, the implications are staggering.

by zamadatix3 hours ago|

[-]

The improvements in tool use and agentic loops have been fast and furious lately, delivering great results. The model growth itself is feeling more "slow and linear" lately, but what you can do with models as part of an overall system has been increasing in growth rate and that has been delivering a lot of value. It matters less if the model natively can keep infinite context or figure things out on its own in one shot so long as it can orchestrate external tools to achieve that over time.

by chasd002 hours ago|

[-]

i have to admit, even if model and tooling progress stopped dead today the world of software development has forever changed and will never go back.

by uywykjdskn1 hours ago|

[-]

Yea the software engineering profession is over, even if all improvements stop now.

by nozzlegear3 hours ago|

[-]

Every S-curve looks like an exponential until you hit the bend.

by NitpickLawyer3 hours ago|

[-]

We've been hearing this for 3 years now. And especially 25 was full of "they've hit a wall, no more data, running out of data, plateau this, saturated that". And yet, here we are. Models keep on getting better, at more broad tasks, and more useful by the month.

by kelnos1 hours ago|

[-]

Yes, and Moore's law took decades to start to fail to be true. Three years of history isn't even close to enough to predict whether or not we'll see exponential improvement, or an unsurmountable plateau. We could hit it in 6 months or 10 years, who knows.

And at least with Moore's law, we had some understanding of the physical realities as transistors would get smaller and smaller, and reasonably predict when we'd start to hit limitations. With LLMs, we just have no idea. And that could be go either way.

[-]

> We've been hearing this for 3 years now

Not from me you haven't!

> "they've hit a wall, no more data, running out of data, plateau this, saturated that"

Everyone thought Moore's Law was infallible too, right until they hit that bend. What hubris to think these AI models are different!

But you've probably been hearing that for 3 years too (though not from me).

> Models keep on getting better, at more broad tasks, and more useful by the month.

If you say so, I'll take your word for it.

by Cyphase2 hours ago|

[-]

25 is 2025.

[-]

Oh my bad, the way it was worded made me read it as the name of somebody's model or something.

by torginus2 hours ago|

[-]

Except for Moore's law, everyone knew decades ahead of what the limits of Dennard scaling are (shrinking geometry through smaller optical feature sizes), and roughly when we would get to the limit.

Since then, all improvements came at a tradeoff, and there was a definite flattening of progress.

[-]

> Since then, all improvements came at a tradeoff, and there was a definite flattening of progress.

Idk, that sounds remarkably similar to these AI models to me.

by fmbb2 hours ago|

[-]

> And yet, here we are.

I dunno. To me it doesn’t even look exponential any more. We are at most on the straight part of the incline.

by 2 hours ago|

[-]

deleted

by raincole3 hours ago|

[-]

This quote would be more impactful if people haven't been repeating it since gpt-4 time.

by kimixa2 hours ago|

[-]

People have also been saying we'd be seeing the results of 100x quality improvements in software with corresponding decease in cost since gpt-4 time.

So where is that?

[-]

I agree, I have been informed that people have been repeating it for three years. Sadly I'm not involved in the AI hype bubble so I wasn't aware. What an embarrassing faux pas.

by gmueckl5 hours ago|

[-]

The result is hardly a clean room implementation. It was rather a brute force attempt to decompress fuzzily stored knowledge contained within the network and it required close steering (using a big suite of tests) to get a reasonable approximation to the desired output. The compression and storage happened during the LLM training.

Prove this statement wrong.

by libraryofbabel3 hours ago|

[-]

Nobody disputes that the LLM was drawing on knowledge in its training data. Obviously it was! But you'll need to be a bit more specific with your critique, because there is a whole spectrum of interpretations, from "it just decompressed fuzzily-stored code verbatim from the internet" (obviously wrong, since the Rust-based C compiler it wrote doesn't exist on the internet) all the way to "it used general knowledge from its training about compiler architecture and x86 and the C language."

Your post is phrased like it's a two sentence slam-dunk refutation of Anthropic's claims. I don't think it is, and I'm not even clear on what you're claiming precisely except that LLMs use knowledge acquired during training, which we all agree on here.

by gmueckl28 minutes ago|

[-]

The result is a fuzzy reproduction of the training input, specifically of the compilers contained within. The reproduction in a different, yet still similar enough programming language does not refute that. The implementation was strongly guided by a compiler and a suite of tests as an explicit filter on those outputs and limiting the acceptable solution space, which excluded unwanted interpolations of the training set that also result from the lossy input compression.

The fact that the implementation language for the compiler is rust doesn't factor into this. ML based natural language translation has proven that model training produces an abstract space of concepts internally that maps from and to different languages on the input and output side. All this points to is that there are different implicitly formed decoders for the same compressed data embedded in the LLM and the keyword rust in the input activates one specific to that programming language.

by nicoburns1 hours ago|

[-]

"clean room" usually means "without looking at the source code" of other similar projects. But presumably the AIs training data would have included GCC, Clang, and probably a dozen other C compilers.

by signatoremo57 minutes ago|

[-]

Suppose you the human are working on a clean room implementation of C compiler, how do you go about doing it? Will you need to know about: a) the C language, and b) the inner working of a compiler? How did you acquire that knowledge?

by NitpickLawyer5 hours ago|

[-]

> Prove this statement wrong.

If all it takes is "trained on the Internet" and "decompress stored knowledge", then surely gpt3, 3.5, 4, 4.1, 4o, o1, o3, o4, 5, 5.1, 5.x should have been able to do it, right? Claude 2, 3, 4, 4.1, 4.5? Surely.

by shakna2 hours ago|

[-]

Well, "Reimplement the c4 compiler - C in four functions" is absolutely something older models can do. Because most are trained, on that quite small product - its 20kb.

But reimplementing that isn't impressive, because its not a clean room implementation if you trained on that data, to make the model that regurgitates the effort.

by signatoremo54 minutes ago|

[-]

> Well, "Reimplement the c4 compiler - C in four functions" is absolutely something older models can do.

Are you sure about that? Do you have some examples? The older Claude models can’t do it according to TFA.

by gmueckl3 hours ago|

[-]

This comparison is only meaningful with comparable numbers of parameters and context window tokens. And then it would mainly test the efficiency and accuracy of the information encoding. I would argue that this is the main improvement over all model generations.

by hn_acc12 hours ago|

[-]

Are you really asking for "all the previous versions were implemented so poorly they couldn't even do this simple, basic LLM task"?

by Philpax1 hours ago|

[-]

Please look at the source code and tell me how this is a "simple, basic LLM task".

by geraneum4 hours ago|

[-]

Perhaps 4.5 could also do it? We don’t know really until we try. I don’t trust the marketing material as much. The fact that the previous version (smaller versions) couldn’t or could do it does not really disprove that claim.

by Marha014 hours ago|

[-]

Even with 1 TB of weights (probable size of the largest state of the art models), the network is far too small to contain any significant part of the internet as compressed data, unless you really stretch the definition of data compression.

by jesse__3 hours ago|

[-]

This sounds very wrong to me.

Take the C4 training dataset for example. The uncompressed, uncleaned, size of the dataset is ~6TB, and contains an exhaustive English language scrape of the public internet from 2019. The cleaned (still uncompressed) dataset is significantly less than 1TB.

I could go on, but, I think it's already pretty obvious that 1TB is more than enough storage to represent a significant portion of the internet.

by FeepingCreature2 hours ago|

[-]

This would imply that the English internet is not much bigger than 20x the English Wikipedia.

That seems implausible.

by jesse__1 hours ago|

[-]

> That seems implausible.

Why, exactly?

Refuting facts with "I doubt it, bro" isn't exactly a productive contribution to the conversation..

by kgeist3 hours ago|

[-]

A lot of the internet is duplicate data, low quality content, SEO spam etc. I wouldn't be surprised if 1 TB is a significant portion of the high-quality, information-dense part of the internet.

by FeepingCreature2 hours ago|

[-]

I would be extremely surprised if it was that small.

by gmueckl3 hours ago|

[-]

This is obviously wrong. There is a bunch of knowledge embedded in those weights, and some of it can be recalled verbatim. So, by virtue of this recall alone, training is a form of lossy data compression.

by 0xCMP3 hours ago|

[-]

I challenge anyone to try building a C compiler without a big suite of tests. Zig is the most recent attempt and they had an extensive test suite. I don't see how that is disqualifying.

If you're testing a model I think it's reasonable that "clean room" have an exception for the model itself. They kept it offline and gave it a sandbox to avoid letting it find the answers for itself.

Yes the compression and storage happened during the training. Before it still didn't work; now it does much better.

by hn_acc12 hours ago|

[-]

The point is - for a NEW project, no one has an extensive test suite. And if an extensive test suite exists, it's probably because the product that uses it also exists, already.

If it could translate the C++ standard INTO an extensive test suite that actually captures most corner cases, and doesn't generate false positives - again, without internet access and without using gcc as an oracle, etc?

by brutalc4 hours ago|

[-]

No one needs to prove you wrong. That’s just personal insecurity trying to justify ones own worth.

by linuxtorvals4 hours ago|

[-]

[flagged]

by panzi2 hours ago|

[-]

> clean-room implementation

Except its trained on all source out there, so I assume on GCC and clang. I wonder how similar the code is to either.

by kelnos1 hours ago|

[-]

Honestly I don't find it that impressive. I mean, it's objectively impressive that it can be done at all, but it's not impressive from the standpoint of doing stuff that nearly all real-world users will want it to do.

The C specification and Linux kernel source code are undoubtedly in its training data, as are texts about compilers from a theoretical/educational perspective.

Meanwhile, I'm certain most people will never need it to perform this task. I would be more interested in seeing if it could add support for a new instruction set to LLVM, for example. Or perhaps write a complier for a new language that someone just invented, after writing a first draft of a spec for it.

by steveklabnik48 minutes ago|

[-]

> Or perhaps write a complier for a new language that someone just invented, after writing a first draft of a spec for it.

Hello, this is what I did over my Christmas break. I've been taking some time to do other things, but plan on returning to it. But this absolutely works. Claude has written far more programs in my language than I have.

https://rue-lang.dev/ if you want to check it out. Spec and code are both linked there.

by simonw1 hours ago|

[-]

Are you a frequent user of coding agents?

I ask because, as someone who uses these things every day, the idea that this kind of thing only works because of similar projects in the training data doesn't fit my mental model of how they work at all.

I'm wondering if the "it's in the training data" theorists are coding agent practitioners, or if they're mainly people who don't use the tools.

by bdangubic1 hours ago|

[-]

I am all-daily user (multiple claude max accounts). this fits my mental model mostly but not model I had before but developed with daily use. my job revolves around two core things:

1. data analysis / visualization / …

2. “is this possible? can this even be done?”

for #1 - I don’t do much anymore, for #2 I mostly do it still all “by hand” not for the lack of serious trying. so “it can do #1 1000x better than me cause it is generally solved problem(s) it is trained on while it can’t effectively do #2 otherwise” fits perfectly

by dyauspitr2 hours ago|

[-]

> Claude did not have internet access at any point during its development

Why is this even desirable? I want my LLM to take into account everything there is out there and give me the best possible output.

by simonw2 hours ago|