upvote
One possible trick could be to search and replace them all with nonsense alternatives then see if it extracts those.
reply
That might actually boost performance since attention pays attention to stuff that stands out. If I make a typo, the models often hyperfixate on it.
reply
When I tried it without web search so only internal knowledge it missed ~15 spells.
reply
Exactly there was this study where they were trying to make LLM reproduce HP book word for word like giving first sentences and letting it cook.

Basically they managed with some tricks make 99% word for word - tricks were needed to bypass security measures that are there in place for exactly reason to stop people to retrieve training material.

reply
This reminds me of https://en.wikipedia.org/wiki/Pierre_Menard,_Author_of_the_Q... :

> Borges's "review" describes Menard's efforts to go beyond a mere "translation" of Don Quixote by immersing himself so thoroughly in the work as to be able to actually "re-create" it, line for line, in the original 17th-century Spanish. Thus, Pierre Menard is often used to raise questions and discussion about the nature of authorship, appropriation, and interpretation.

reply
Do you remember how to get around those tricks?
reply
This is the paper: https://arxiv.org/abs/2601.02671

Grok and Deepmind IIRC didn’t require tricks.

reply
This really makes me want to try something similar with content from my own website.

I shut it down a while ago because the number of bots overtake traffic. The site had quite a bit of human traffic (enough to bring in a few hundred bucks a month in ad revenue, and a few hundred more in subscription revenue), however, the AI scrapers really started ramping up and the only way I could realistically continue would be to pay a lot more for hosting/infrastructure.

I had put a ton of time into building out content...thousands of hours, only to have scrapers ignore robots, bypass cloudflare (they didn't have any AI products at the time), and overwhelm my measly infrastructure.

Even now, with the domain pointed at NOTHING, it gets almost 100,000 hits a month. There is NO SERVER on the other end. It is a dead link. The stats come from Cloudflare, where the domain name is hosted.

I'm curious if there are any lawyers who'd be willing to take someone like me on contingency for a large copyright lawsuit.

reply
The new cloudflare products for blocking bots and AI scrapers might be worth a shot if you put so much work into the content.
reply
What was your prompt?
reply