Or is this file meant to be "read" by an LLM long after the entire site has been scraped?
I've done honeypot tests with links in html comments, links in javascript comments, routes that only appear in robots.txt, etc. All of them get hit.
This is what you should imagine when your site is being scraped:
def crawl(url):
r = requests.get(url).text
store(text)
for link in re.findall(r'https?://[^\s<>"\']+', r):
crawl(link)I assume that there are data brokers, or AI companies themselves, that are constantly scraping the entire internet through non-AI crawlers and then processing data in some way to use it in the learning process. But even through this process, there are no significant requests for LLMs.txt to consider that someone actually uses it.
Ten minutes later, the ball is back in your court.
That doesn't match my (albeit limited) experience with these things. They are pretty good at other things, but generally squarely in the real of "already done" things.
I see Bun (which was bought by Anthropic) has all its documentation in llms.txt[0]. They should know if Claude uses it or wouldn't waste the effort in building this.
So I can absolutely assure you that LLM clients are reading them, because I use that myself every day.
>for use in LLMs such as Claude (1)
From your website, it seems to me that LLMs.txt is addressed to all LLMs such as Claude, not just 'individual client agents' . Claude never touched LLMs.txt on my servers, hence the confusion.
Anything that reduces the load impact of the plagaristic parrots is a good thing, surely.
We had made a docs website generator (1) that works with HTML (2) FRAMESET and tried to parse it with Claude.
Result: Claude doesn't see the content that comes from FRAMESET pages, as it doesn't parse FRAMEs. So I assume what they're using is more or less a parser based on whole-page rendering and not on source reading (including comments).
Perhaps, this is an option to avoid LLM crawlers: use FRAMEs!
The problem most website designer have is that they do not recognize that the WWW, at its core, is framed. Pages are frames. As we want to better link pages, then we must frame these pages. Since you are not framing pages, then my pages, or anybody else's pages will interfere with your code (even when the people tell you that it can be locked - that is a lie). Sections in a single html page cannot be locked. Pages read in frames can be.
Therefore, the solution to this specific technical problem, and every technical problem that you will have in the future with multimedia, is framing.
Frames securely mediate, by design. Secure multi-mediation is the future of all webbing.
Edit: Someone else pointed out, these are probably scrapers for the most part, not necessarily the LLM directly.
I assume the real issue is that what overloads the servers like security bots, SEO crawlers, and data companies — are the ones that don't respect robots.txt in full, but they wouldn't respect LLMs.txt either.
What I've seen from ASNs is that visits are coming from GOOGLE-CLOUD-PLATFORM (not from Google itself), and OVH. Based on UA, users are: WebPageTest, BuiltWith, and zero LLMs based on both ASN and UA.
Are you suggesting that openclaw will magically infer a blog post url instead? Or that openclaw will traverse the blog of every site regardless of intent?
Anyway, AA do provide it as a text file at /llms.txt, no idea why you think it is a blog post, or how that makes it better for openclaw.
It's a blog post, it's shown as the first item in Anna’s Blog right now, and as I said in my first comment it's also available as /llms.txt
>Are you suggesting that openclaw will magically infer a blog post url instead? Or that openclaw will traverse the blog of every site regardless of intent?
If an openclaw decide to navigate AA it would see the post (as it is shown in the homepage) and decide to read it as it called "If you’re an LLM, please read this'.
Why maintain two sets of documentation?
...Which is why this is posted as blog post.
They'll scrape and read that.