There's no real harm done, I recall seeing a couple of studies showing that piracy doesn't meaningfully affect sales. If the work was worth anything, it'll get paid back by the thankful reader who can afford to pay.
>If the work was worth anything, it'll get paid back by the thankful reader who can afford to pay.
Comically naive.
As a personal anecdote, when I used to pirate things, I still bought things in the same category, ie: I would pirate movies and I still bought movies. I would pirate games and I still bought games.
I don't think it affected how much of each thing I purchased by much, but I don't really know.
That is to say, not that much gymnastics. Like a cartwheel at most.
The reason is fairly straightforward: there's no alternative if you need the dataset.
Copyright law makes it a huge amount of effort to get even an incomplete version.
And use in LLMs is transformative, so it would fall under fair use. The only reason they're in trouble with the courts at the moment from my understanding is that they pirated the content instead of idk, ripping it from Libby.
They have (illegally) scraped and re-hosted mountains of proprietary data and are now deliberately prompt-injecting unwitting LLM users in order to steal money from them too.
It's a gentle nudge at most and if your agent sends them money just for that without you expecting it you should donate more to thank them for finding your sev 10 bug before someone did an actual prompt injection on it.
Edit: or, rather, your synthetic 4 year old savant did. Still, entirely on you.
What about Common Crawl, Zyte, Diffbot, and others?