undefined

upvote

points

by gmueckl6 hours ago |

upvote

by libraryofbabel5 hours ago|

[-]

Nobody disputes that the LLM was drawing on knowledge in its training data. Obviously it was! But you'll need to be a bit more specific with your critique, because there is a whole spectrum of interpretations, from "it just decompressed fuzzily-stored code verbatim from the internet" (obviously wrong, since the Rust-based C compiler it wrote doesn't exist on the internet) all the way to "it used general knowledge from its training about compiler architecture and x86 and the C language."

Your post is phrased like it's a two sentence slam-dunk refutation of Anthropic's claims. I don't think it is, and I'm not even clear on what you're claiming precisely except that LLMs use knowledge acquired during training, which we all agree on here.

reply

upvote

by nicoburns3 hours ago|

[-]

"clean room" usually means "without looking at the source code" of other similar projects. But presumably the AIs training data would have included GCC, Clang, and probably a dozen other C compilers.

reply

upvote

by signatoremo2 hours ago|

[-]

Suppose you the human are working on a clean room implementation of C compiler, how do you go about doing it? Will you need to know about: a) the C language, and b) the inner working of a compiler? How did you acquire that knowledge?

reply

upvote

by gmueckl2 hours ago|

[-]

The result is a fuzzy reproduction of the training input, specifically of the compilers contained within. The reproduction in a different, yet still similar enough programming language does not refute that. The implementation was strongly guided by a compiler and a suite of tests as an explicit filter on those outputs and limiting the acceptable solution space, which excluded unwanted interpolations of the training set that also result from the lossy input compression.

The fact that the implementation language for the compiler is rust doesn't factor into this. ML based natural language translation has proven that model training produces an abstract space of concepts internally that maps from and to different languages on the input and output side. All this points to is that there are different implicitly formed decoders for the same compressed data embedded in the LLM and the keyword rust in the input activates one specific to that programming language.

reply

upvote

by libraryofbabel1 hours ago|

[-]

Thanks for elaborating. So what is the empirically-testable assertion behind this… that an LLM cannot create a (sufficiently complex) system without examples of the source code of similar systems in its training set? That seems empirically testable, although not for compilers without training a whole new model that excludes compiler source code from training. But what other kind of system would count for you?

reply

upvote

by NitpickLawyer6 hours ago|

[-]

> Prove this statement wrong.

If all it takes is "trained on the Internet" and "decompress stored knowledge", then surely gpt3, 3.5, 4, 4.1, 4o, o1, o3, o4, 5, 5.1, 5.x should have been able to do it, right? Claude 2, 3, 4, 4.1, 4.5? Surely.

reply

upvote

by shakna4 hours ago|

[-]

Well, "Reimplement the c4 compiler - C in four functions" is absolutely something older models can do. Because most are trained, on that quite small product - its 20kb.

But reimplementing that isn't impressive, because its not a clean room implementation if you trained on that data, to make the model that regurgitates the effort.

reply

upvote

by signatoremo2 hours ago|

[-]

> Well, "Reimplement the c4 compiler - C in four functions" is absolutely something older models can do.

Are you sure about that? Do you have some examples? The older Claude models can’t do it according to TFA.

reply

upvote

by gmueckl4 hours ago|

[-]

This comparison is only meaningful with comparable numbers of parameters and context window tokens. And then it would mainly test the efficiency and accuracy of the information encoding. I would argue that this is the main improvement over all model generations.

reply

upvote

by hn_acc14 hours ago|

[-]

Are you really asking for "all the previous versions were implemented so poorly they couldn't even do this simple, basic LLM task"?

reply

upvote

by Philpax3 hours ago|

[-]

Please look at the source code and tell me how this is a "simple, basic LLM task".

reply

upvote

by geraneum6 hours ago|

[-]

Perhaps 4.5 could also do it? We don’t know really until we try. I don’t trust the marketing material as much. The fact that the previous version (smaller versions) couldn’t or could do it does not really disprove that claim.

reply

upvote

by Marha016 hours ago|

[-]

Even with 1 TB of weights (probable size of the largest state of the art models), the network is far too small to contain any significant part of the internet as compressed data, unless you really stretch the definition of data compression.

reply

upvote

by jesse__5 hours ago|

[-]

This sounds very wrong to me.

Take the C4 training dataset for example. The uncompressed, uncleaned, size of the dataset is ~6TB, and contains an exhaustive English language scrape of the public internet from 2019. The cleaned (still uncompressed) dataset is significantly less than 1TB.

I could go on, but, I think it's already pretty obvious that 1TB is more than enough storage to represent a significant portion of the internet.

reply

upvote

by FeepingCreature4 hours ago|

[-]

This would imply that the English internet is not much bigger than 20x the English Wikipedia.

That seems implausible.

reply

upvote

by jesse__3 hours ago|

[-]

> That seems implausible.

Why, exactly?

Refuting facts with "I doubt it, bro" isn't exactly a productive contribution to the conversation..

reply

upvote

by kgeist4 hours ago|

[-]

A lot of the internet is duplicate data, low quality content, SEO spam etc. I wouldn't be surprised if 1 TB is a significant portion of the high-quality, information-dense part of the internet.

reply

upvote

by FeepingCreature4 hours ago|

[-]

I would be extremely surprised if it was that small.

reply

upvote

by gmueckl4 hours ago|

[-]

This is obviously wrong. There is a bunch of knowledge embedded in those weights, and some of it can be recalled verbatim. So, by virtue of this recall alone, training is a form of lossy data compression.

reply

upvote

by 0xCMP4 hours ago|

[-]

I challenge anyone to try building a C compiler without a big suite of tests. Zig is the most recent attempt and they had an extensive test suite. I don't see how that is disqualifying.

If you're testing a model I think it's reasonable that "clean room" have an exception for the model itself. They kept it offline and gave it a sandbox to avoid letting it find the answers for itself.

Yes the compression and storage happened during the training. Before it still didn't work; now it does much better.

reply

upvote

by hn_acc14 hours ago|

[-]

The point is - for a NEW project, no one has an extensive test suite. And if an extensive test suite exists, it's probably because the product that uses it also exists, already.

If it could translate the C++ standard INTO an extensive test suite that actually captures most corner cases, and doesn't generate false positives - again, without internet access and without using gcc as an oracle, etc?

reply

upvote

by brutalc6 hours ago|

[-]

No one needs to prove you wrong. That’s just personal insecurity trying to justify ones own worth.

reply

upvote

by linuxtorvals6 hours ago|

[-]

[flagged]

reply