upvote
I feel like you're making a logical leap here by assuming lossy and failure to reproduce in entirety implies inability to recognize. As a trivial example, I can take a sha256 hash of your comment here, lose the ability to reproduce it, but still have an extremely accurate ability to recognize whether some text is exactly your comment or not. Obviously hashing every substring would not be a particularly efficient strategy, but my point is that saying "it's lossy" isn't particularly compelling without other details.
reply
I haven't been following it well but isn't part of the NYT lawsuit against OpenAI that it sometimes spits out NYT articles verbatim?
reply
Genome analysis is also a lossy process that chops the data up into tiny bits, like a newspaper sent through a shredder. We then piece together matching sequences in a sort of puzzle. It's often a relatively inaccurate solution. Then we try to do that again with a different copy of the newspaper sent through a different shredder. And again. A genome might be comprised of 10x reads, 30x reads, 100x reads, with more replications representing higher confidence.

There might be ten million people who have quoted Harry Potter at some point in their blogs or forum posts. There are only so many words in the books.

reply
Study: Meta AI model can reproduce almost half of Harry Potter book

https://arstechnica.com/features/2025/06/study-metas-llama-3...

reply
See also GEMA vs. OpenAI.
reply
It is lossy, but it is still enough for verbatim recreations. All of Wikipedia is just 24GB of lossless compressed text and all of JK Rowling's work fits into a few MB. So these things would easily be storable verbatim in trillion parameter models. Reasoning about the training cutoff is also something that the newest models do pretty well, because you can teach them to do so after pre training using e.g. SFT. With tool use it can then even check actual current sources, which may happen without you even knowing in the normal chat apps unless you use a controlled API call.
reply
How do you know, how the model works? If there was an index of all Micken's writings, or even if the model searched the web before feeding the response to you, you wouldn't know by observing from the outside.
reply
i suppose a quick test would be getting the model to write down Micken's essay end to end.

if the original essay was stuffed within the prompt window. the result will be word accurate.

unless this is a model trained specifically on Micken's essay (which claude is not).

reply
This seems like a classic case of doing it being proof that it can happen, but not doing it being insufficient proof that it's impossible. I don't think there's a "quick test" of whether there might be a more effective prompt that would cause it to reproduce more effectively.
reply
that's in the ideal scenario where it's only seen a single copy of it tho
reply
Haven’t there been repeated experiments that show if you jailbreak most frontier models’ harnesses you can get them to output near verbatim copyrighted works?

I swear there was a whole court case about this in the last year.

reply