undefined

[-]

It didn't use web search. But for sure it has some internal knowledge already. It's not a perfect needle in the hay stack problem but gemini flash was much worse when I tested it last time.

by viraptor3 hours ago|

[-]

If you want to really test this, search/replace the names with your own random ones and see if it lists those.

Otherwise, LLMs have most of the books memorised anyway: https://arstechnica.com/features/2025/06/study-metas-llama-3...

by ribosometronome2 hours ago|

[-]

Couldn't you just ask the LLM which 50 (or 49) spells appear in the first four Harry Potter books without the data for comparison?

by angst2 minutes ago|

[-]

btw it recalls 42 when i asked. (without web search)

full transcript: pastebin.com/sMcVkuwd

by viraptor2 hours ago|

[-]

It's not going to be as consistent. It may get bored of listing them (you know how you can ask for many examples and get 10 in response?), or omit some minor ones for other reasons.

By replacing the names with something unique, you'll get much more certainty.

by 7 minutes ago|

[-]

deleted

by Grimblewald2 hours ago|

[-]

might not work well, but by navigating to a very harry potter dominant part of latent space by preconditioning on the books you make it more likely to get good results. An example would be taking a base model and prompting "what follows is the book 'X'" it may or may not regurgitate the book correctly. Give it a chunk of the first chapter and let it regurgitate from there and you tend to get fairly faithful recovery, especially for things on gutenberg.

So it might be there, by predcondiditioning latent space to the area of harry potter world, you make it so much more probable that the full spell list is regurgitated from online resources that were also read, while asking naive might get it sometimes, and sometimes not.

the books act like a hypnotic trigger, and may not represent a generalized skill. Hence why replacing with random words would help clarify. if you still get the origional spells, regurgitation confirmed, if it finds the spells, it could be doing what we think. An even better test would be to replace all spell references AND jumble chapters around. This way it cant even "know" where to "look" for the spell names from training.

by 1 hours ago|

[-]

deleted

by heavyset_go18 minutes ago|

[-]

No, because you don't know the magic spell (forgive me) of context that can be used to "unlock" that information if it's stored in the NN.

I mean, you can try, but it won't be a definitive answer as to whether that knowledge truly exists or doesn't exist as it is encoded into the NN. It could take a lot of context from the books themselves to get to it.

by joshmlewis3 hours ago|

[-]

I think the OP was implying that it's probably already baked into its training data. No need to search the web for that.

by 2 hours ago|

[-]

deleted

by obirunda1 hours ago|

[-]

This underestimates how much of the Internet is actually compressed into and is an integral part of the model's weights. Gemini 2.5 can recite the first Harry Potter book verbatim for over 75% of the book.

by soulofmischief3 hours ago|

[-]

The only worthwhile version of this test involves previously unseen data that could not have been in the training set. Otherwise the results could be inaccurate to the point of harmful.

by Trasmatta1 hours ago|

[-]

Do the same experiment in the Claude web UI. And explicitly turn web searches off. It got almost all of them for me over a couple of prompts. That stuff is already in its training data.

by eek21213 hours ago|

[-]

Honestly? My advice would be to cook something custom up! You don't need to do all the text yourself. Maybe have AI spew out a bunch of text, or take obscure existing text and insert hidden phrases here or there.

Shoot, I'd even go so far as to write a script that takes in a bunch of text, reorganizes sentences, and outputs them in a random order with the secrets. Kind of like a "Where's Waldo?", but for text

Just a few casual thoughts.

I'm actually thinking about coming up with some interesting coding exercises that I can run across all models. I know we already have benchmarks, however some of the recent work I've done has really shown huge weak points in every model I've run them on.

by clhodapp2 hours ago|

[-]

Having AI spew it might suffer from the fact that the spew itself is influenced by AI's weights. I think your best bet would be to use a new human-authored work that was released after the model's context cutoff.

by xiomrze4 hours ago|

[-]

Honest question, how do you know if it's pulling from context vs from memory?

If I use Opus 4.6 with Extended Thinking (Web Search disabled, no books attached), it answers with 130 spells.

by petercooper4 hours ago|

[-]

One possible trick could be to search and replace them all with nonsense alternatives then see if it extracts those.

by andai4 hours ago|

[-]

That might actually boost performance since attention pays attention to stuff that stands out. If I make a typo, the models often hyperfixate on it.

by ozim4 hours ago|

[-]

Exactly there was this study where they were trying to make LLM reproduce HP book word for word like giving first sentences and letting it cook.

Basically they managed with some tricks make 99% word for word - tricks were needed to bypass security measures that are there in place for exactly reason to stop people to retrieve training material.

by pron3 hours ago|

[-]

This reminds me of https://en.wikipedia.org/wiki/Pierre_Menard,_Author_of_the_Q... :

> Borges's "review" describes Menard's efforts to go beyond a mere "translation" of Don Quixote by immersing himself so thoroughly in the work as to be able to actually "re-create" it, line for line, in the original 17th-century Spanish. Thus, Pierre Menard is often used to raise questions and discussion about the nature of authorship, appropriation, and interpretation.

[-]

Do you remember how to get around those tricks?

by djhn3 hours ago|

[-]

This is the paper: https://arxiv.org/abs/2601.02671

Grok and Deepmind IIRC didn’t require tricks.

by eek21213 hours ago|

[-]

This really makes me want to try something similar with content from my own website.

I shut it down a while ago because the number of bots overtake traffic. The site had quite a bit of human traffic (enough to bring in a few hundred bucks a month in ad revenue, and a few hundred more in subscription revenue), however, the AI scrapers really started ramping up and the only way I could realistically continue would be to pay a lot more for hosting/infrastructure.

I had put a ton of time into building out content...thousands of hours, only to have scrapers ignore robots, bypass cloudflare (they didn't have any AI products at the time), and overwhelm my measly infrastructure.

Even now, with the domain pointed at NOTHING, it gets almost 100,000 hits a month. There is NO SERVER on the other end. It is a dead link. The stats come from Cloudflare, where the domain name is hosted.

I'm curious if there are any lawyers who'd be willing to take someone like me on contingency for a large copyright lawsuit.

by camdenreslink2 hours ago|

[-]

The new cloudflare products for blocking bots and AI scrapers might be worth a shot if you put so much work into the content.

[-]

When I tried it without web search so only internal knowledge it missed ~15 spells.

by clanker_fluffer4 hours ago|

[-]

What was your prompt?

by huangmeng17 seconds ago|

[-]

you are rich

by meroes4 hours ago|

[-]

What is this supposed to show exactly? Those books have been feed into LLMs for years and there's even likely specific RLHF's on extracting spells from HP.

by muzani3 hours ago|

[-]

There was a time when I put the EA-Nasir text into base64 and asked AI to convert it. Remarkably it identified the correct text but pulled the most popular translation of the text than the one I gave it.

by majewsky1 hours ago|

[-]

Sucks that you got a really shitty response to your prompt. If I were you, the model provider would be receiving my complaint via clay tablet right away.

by rvz4 hours ago|

[-]

> What is this supposed to show exactly?

Nothing.

You can be sure that this was already known in the training data of PDFs, books and websites that Anthropic scraped to train Claude on; hence 'documented'. This is why tests like what the OP just did is meaningless.

Such "benchmarks" are performative to VCs and they do not ask why isn't the research and testing itself done independently but is almost always done by their own in-house researchers.

by zamadatix4 hours ago|

[-]

To be fair, I don't think "Slugulus Eructo" (the name) is actually in the books. This is what's in my copy:

> The smug look on Malfoy’s face flickered.

> “No one asked your opinion, you filthy little Mudblood,” he spat.

> Harry knew at once that Malfoy had said something really bad because there was an instant uproar at his words. Flint had to dive in front of Malfoy to stop Fred and George jumping on him, Alicia shrieked, “How dare you!”, and Ron plunged his hand into his robes, pulled out his wand, yelling, “You’ll pay for that one, Malfoy!” and pointed it furiously under Flint’s arm at Malfoy’s face.

> A loud bang echoed around the stadium and a jet of green light shot out of the wrong end of Ron’s wand, hitting him in the stomach and sending him reeling backward onto the grass.

> “Ron! Ron! Are you all right?” squealed Hermione.

> Ron opened his mouth to speak, but no words came out. Instead he gave an almighty belch and several slugs dribbled out of his mouth onto his lap.

by sobjornstad2 hours ago|

[-]

I have a vague recollection that it might come up named as such in Half-Blood Prince, written in Snape's old potions textbook?

In support of that hypothesis, the Fandom site lists it as “mentioned” in Half-Blood Prince, but it says nothing else and I'm traveling and don't have a copy to check, so not sure.

by zamadatix43 minutes ago|

[-]

Hmm, I don't get a hit for "slugulus" or "eructo" (case insensitive) in any of the 7. Interestingly two mentions of "vomit" are in book 6, but neither in reference to to slugs (plenty of Slughorn of course!). Book 5 was the only other one a related hit came up:

> Ron nodded but did not speak. Harry was reminded forcibly of the time that Ron had accidentally put a slug-vomiting charm on himself. He looked just as pale and sweaty as he had done then, not to mention as reluctant to open his mouth.

There could be something with regional variants but I'm doubtful as the Fandom site uses LEGO Harry Potter: Years 1-4 as the citation of the spell instead of a book.

Maybe the real LLM is the universe and we're figuring this out for someone on Slacker News a level up!

by ck_one4 hours ago|

[-]

Then it's fair that id didn't find it

by kybernetikos2 hours ago|

[-]

I recently got junie to code me up an MCP for accessing my calibre library. https://www.npmjs.com/package/access-calibre

My standard test for that was "Who ends up with Bilbo's buttons?"

by muzani3 hours ago|

[-]

There's a benchmark which works similarly but they ask harder questions, also based on books https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/o...

I guess they have to add more questions as these context windows get bigger.

by siwatanejo1 hours ago|

[-]

> All 7 books come to ~1.75M tokens

How do you know? Each word is one token?

by koakuma-chan43 minutes ago|

[-]

You can download the books and run them through a tokenizer. I did that half a year ago and got ~2M.

by dwa35923 hours ago|

[-]

have another LLM (gemini, chatgpt) make up 50 new spells. insert those and test and maybe report here :)

by dom962 hours ago|

[-]

I often wonder how much of the Harry Potter books were used in the training. How long before some LLM is able to regurgitate full HP books without access to the internet?

by bartman3 hours ago|

[-]

Have you by any chance tried this with GPT 4.1 too (also 1M context)?

by LanceJones3 hours ago|

[-]

Assuming this experiment involved isolating the LLM from its training set?

by irishcoffee2 hours ago|

[-]

The top comment is about finding basterized latin words from childrens books. The future is here.

by Geste2 hours ago|

[-]

I'll have some of that coffee too, this is quite a sad time we're living where this is a proper use of our limited resources.

by guluarte4 hours ago|

[-]

you can get the same result just asking opus/gpt, it is probably internalized knowledge from reddit or similar sites.

[-]

If you just ask it you don't get the same result. Around 13 spells were missing when I just prompted Opus 4.6 without the books as context.

by TheRealPomax2 hours ago|

[-]

That doesn't seem a super useful test for a model that's optimized for programming?

by 4 hours ago|

[-]

deleted

by IhateAI1 hours ago|

[-]

like I often say, these tools are mostly useful for people to do magic tricks on themselves (and to convince C-suites that they can lower pay, and reduce staff if they pay Anthropic half their engineering budget lmao )

by adarsh23213 hours ago|