I have website small enough not to attract too many bot, but sometime, something very innocent can bring my website down.
For example, I put a php ical viewer.. and some crawler start loading the calendar page, taking up all the cpu cycle.
Mental note, make sure my robots.txt files contain a few references to slowly returning pages full of almost nonsense that link back to each other endlessly…
Not complete nonsense, that would be reasonably easy to detect and ignore. Perhaps repeats of your other content with every 5th word swapped with a random one from elsewhere in the content, every 4th word randomly misspelt, every seventh word reversed, every seventh sentence reversed, add a random sprinkling of famous names (Sir John Major, Arc de Triomphe, Sarah Jane Smith, Viltvodle VI) that make little sense in context, etc. Not enough change that automatic crap detection sees it as an obvious trap, but more than enough that ingesting data from your site into any model has enough detrimental effect to token weightings to at least undo any beneficial effect it might have had otherwise.
And when setting traps like this, make sure the response is slow enough that it won't use much bandwidth, and the serving process is very lightweight, and just in case that isn't enough make sure it aborts and errors out if any load metric goes above a given level.
A bit like this? ( iocaine is newer)
Is iocaine actually newer though? Its first commit dates to 2025-01, while the blog post is from 2025-03. I couldn't find info on when Cloudflare started theirs. There's also Nepenthes, which had its first release in 2025-01 too.