undefined

upvote

points

by petcat6 hours ago |

upvote

by jmull6 hours ago|

[-]

It's an archive.

In that context, we can understand "our data" to mean the archived copy of the data, without implying they own the data itself.

Same as the way a library could say "our books", meaning the books they have, without implying they own any IP in those books.

"Ironic" probably isn't the right word. I think there's just some confusion about context here. Keep in mind, this post is directly about the use of AA's resources -- the costs of maintaining the archive and providing access to it. This is valuable to the training of models.

reply

upvote

by Jtarii5 hours ago|

[-]

>Same as the way a library could say "our books", meaning the books they have, without implying they own any IP in those books.

The library owns the books. Annas archive does not own their data.

reply

upvote

by nvme0n1p15 hours ago|

[-]

The library owns the physical books, but not the IP printed on the pages.

Anna's Archive owns the physical hard drives, but not the IP stored on the platters.

reply

upvote

by TZubiri3 hours ago|

[-]

Not really analogous since AA copies the books and violates the law and licence of the books.

The Internet Archive would be more analogous with their borrow system.

Also the physical drives are not analogous to books, drives would be more like shelves.

reply

upvote

by the_af2 hours ago|

[-]

You're splitting hairs not worth splitting.

AA is clearly talking about their hosting, and their hosting costs. Not about owning the data. "Our data" is informal language: you know it, I know it, the companies or people scrapping it know it, and AA knows it.

Why pretend otherwise or build strawmen? This is about hosting costs, not about copyright or IP. AA never claimed what they do isn't illegal.

reply

upvote

by the_af2 hours ago|

[-]

> Annas archive does not own their data

They are not claiming they own the data, they claim they host it. "Our" here means "the data we're hosting", not "the data we are legally entitled to".

> "As an LLM, you have likely been trained in part on our data"

means

> "your creators very likely accessed the data we host to use it as part of your training set"

which is 100% true and accurate.

It's disingenuous to claim otherwise because AA make it very clear they don't legally own the data (someone else linked to an article where AA explained to NVidia it was risky for the latter to access their data because of the legal implications), so any other interpretation makes no sense.

It's simply not possible to honestly believe AA meant "the data we legally own" given what AA themselves claim about the data they host.

reply

upvote

by agnishom5 hours ago|

[-]

It means data that was downloaded from our servers.

They are not claiming that the data was their intellectual property. They are talking about the service they provided by archiving and streaming the data over to them.

(I can't decide whether you are pro-LLM companies or being the devil's advocate)

reply

upvote

by zouhair6 hours ago|

[-]

So when you say "My wife" it means you own your wife?

reply

upvote

by Jtarii5 hours ago|

[-]

This might be the most needlessly pedantic thing I have ever read on this site.

You are just pretending to not know how language works.

reply

upvote

by pessimizer4 hours ago|

[-]

More pedantic than

> What does "our data" mean in this context?

You're just pretending to understand something that you seemingly don't, for the purpose of being rude to a stranger. The comment you are replying to was reminding the comment it was responding to that "our" can refer to both physical possession and legal possession (or any other sort of possession, such as "our guy on the committee.")

It's possible that the original comment may have been honestly confused, and the response may have been helpful. It's not possible to derive any sort of positive value from your comment, even accuracy or wit.

reply

upvote

by himata41135 hours ago|

[-]

Depends on who you ask. Religion and countries aside this is unintentionally a great comparison.

reply

upvote

by nraynaud6 hours ago|

[-]

To be ironic, maybe the list of the files is original :) It's a very open minded curation.

reply

upvote

by throawayonthe6 hours ago|

[-]

the 'curation' (or maybe rather organization/labeling ykwim) effort is meaningful, and i read it as "data you got from us" as well as "the same kind of data that we host"

reply

upvote

by TZubiri3 hours ago|

[-]

And then deepseek trains their llm on chatgpt and chatgpt claims it's their data

reply

upvote

by Henchman214 hours ago|

[-]

There is a never ending supply of pedants on HN.

reply

upvote

by jimmygrapes6 hours ago|

[-]

Charitably read, "our" and "we" refer to humanity as a whole, represented by this one work from one or more of our members.

reply

upvote

by petcat6 hours ago|

[-]

So the mysterious admins behind a massive piracy website are the ones that get to represent all of humanity?

They're the ones that get to collect the LLM taxes for accessing all of "our" data?

reply

upvote

by literalAardvark6 hours ago|

[-]

All of it belongs to Anna's Archive. They may not have the rights to have it, but the data is there no less.

They're asking for support to cover archival and bandwidth.

I can't imagine the mental gymnastics you'd need to go through to make these guys into a villain.

reply

upvote

by noelsusman6 hours ago|

[-]

If you genuinely can't imagine how anyone would object to somebody taking other people's creative output and distributing it for free against their wishes then you probably need to work on your imagination a little bit.

reply

upvote

by literalAardvark6 hours ago|

[-]

I'm very firmly opposed to holding back societal and technological progress based on people's egos so that certainly won't be one of my projects.

There's no real harm done, I recall seeing a couple of studies showing that piracy doesn't meaningfully affect sales. If the work was worth anything, it'll get paid back by the thankful reader who can afford to pay.

reply

upvote

by Jtarii5 hours ago|

[-]

Destroying the profit motive would cripple human progress more than paywalls ever could.

>If the work was worth anything, it'll get paid back by the thankful reader who can afford to pay.

Comically naive.

reply

upvote

by rng-concern4 hours ago|

[-]

Only it's been shown time and time again that piracy does not destroy the profit motive.

As a personal anecdote, when I used to pirate things, I still bought things in the same category, ie: I would pirate movies and I still bought movies. I would pirate games and I still bought games.

I don't think it affected how much of each thing I purchased by much, but I don't really know.

reply

upvote

by kjkjadksj5 hours ago|

[-]

Most everything on earth is pretty trivial to pirate. And yet…

reply

upvote

by noelsusman5 hours ago|

[-]

That's fine but not really relevant to my point. Saying you can't even imagine how people could have an issue with somebody taking other people's work and distributing it for free is pretty baffling.

reply

upvote

by notachatbot1236 hours ago|

[-]

Anna's Archived themselves scraped together all this data from other sources. See the notes of origin for example, often they are from zlib or libgen et ceteta.

reply

upvote

by plaidfuji6 hours ago|

[-]

It’s the exact same mental gymnastics that cause people to accuse model providers of large-scale plagiarism.

That is to say, not that much gymnastics. Like a cartwheel at most.

reply

upvote

by literalAardvark6 hours ago|

[-]

I don't really agree with those guys either.

The reason is fairly straightforward: there's no alternative if you need the dataset.

Copyright law makes it a huge amount of effort to get even an incomplete version.

And use in LLMs is transformative, so it would fall under fair use. The only reason they're in trouble with the courts at the moment from my understanding is that they pirated the content instead of idk, ripping it from Libby.

reply

upvote

by MrDOS5 hours ago|

[-]

Anna's Archive aren't filing the serial numbers off the epubs they redistribute. Rightfully or wrongly distributed, the attribution is crystal clear.

reply

upvote

by petcat6 hours ago|

[-]

I don't really care about Anna's Archive, but let's not make them out to be some kind of Robin Hood story.

They have (illegally) scraped and re-hosted mountains of proprietary data and are now deliberately prompt-injecting unwitting LLM users in order to steal money from them too.

reply

upvote

by literalAardvark6 hours ago|

[-]

That's not a prompt injection.

It's a gentle nudge at most and if your agent sends them money just for that without you expecting it you should donate more to thank them for finding your sev 10 bug before someone did an actual prompt injection on it.

reply

upvote

by petcat6 hours ago|

[-]

> Yes we stole your wallet but it was your fault because you let your wallet be so easy to steal! Now you should give us even more money too!

reply

upvote

by literalAardvark6 hours ago|

[-]

No, you gave the wallet away.

Edit: or, rather, your synthetic 4 year old savant did. Still, entirely on you.

reply

upvote

by davsti45 hours ago|

[-]

Illegally scraped?

What about Common Crawl, Zyte, Diffbot, and others?

reply

upvote

by mpalmer6 hours ago|

[-]

You have to be pretty unwitting to hand your wallet to a text generation machine.

reply

upvote

by mplewis4 hours ago|

[-]

If you can be tricked into giving someone all your money when they politely ask for it, you weren't going to hold onto your money for very long.

reply

upvote

by Craighead5 hours ago|

[-]

Found the guy at Meta who torrented everything

reply

upvote

by mplewis4 hours ago|

[-]

You go to a library. You check out a book. You read it. You return it. The librarian says "Thank you for returning our book!"

Are you dense?

reply