undefined

points

by jaen5 hours ago |

comments

by Ajedi324 hours ago|

[-]

No, public data is not generally written by "experts expecting compensation".

By the way, I don't expect you to pay me for this comment. You can just read it for free. You're welcome.

by jaen3 hours ago|

parent|

[-]

Ugh, please don't read strawmen into other's arguments and try to follow the HN guidelines.

Also, how about making proper arguments yourself? The vast majority of the training data isn't generated by company-paid AI experts either.

Notably, books, even though they don't form a large part of the training data, significantly improve performance on some tasks (same way as expert-generated data).

Why do you think the AI labs are so eager about scanning (and then destroying) every book on the planet?

If you removed all copyrighted works from the training corpus, the model would be notably weaker.

by calgoo4 hours ago|

parent|

prev|

[-]

No, but people do upload data with an expectation that the data not being used without their permission (unless they do a BSD/MIT/Public domain like license). Otherwise, the platform AND/OR the user do expect the data NOT to be used for purposes other then what it was intended for. Your comment is still your comment, and the hacker news platform also has a say in this. If there had been an opt-in, then fine no problem, but there was none, they just trained on everything available, including downloading pirated books from the internet.

by Ajedi324 hours ago|

parent|

[-]

I think it's unreasonable to post anything on a public forum and then expect to be able to control who reads it and for what purpose.

by calgoo4 hours ago|

parent|

prev|

[-]

Answering here as it wont let me reply: Just because you feel that something that is public, does not mean you can do whatever you want with it. You can't just copy an article from a news site and paste it on yours, that theft. If you dont agree, fine, but that is the law, and ALL the mega corps have been fighting to keep it this way for the last 20 years. If they want to steal everyones info, fine, then lobby to change the copyright laws and no problem.

by sneak3 hours ago|

parent|

[-]

Copying isn’t theft, we settled this in the 90s.

by pastel87394 hours ago|

parent|

prev|

[-]

Books?

by Ajedi324 hours ago|

parent|

[-]

The vast, vast, majority of AI training data is not books. I wouldn't be surprised if there's more text on HN alone than every book in the history of mankind (most of which are also no longer copyrighted).

by rafram5 hours ago|

prev|

[-]

I didn't say that.

by thom5 hours ago|

parent|

[-]

No, you just parroted an increasingly popular talking point, the entire purpose of which seems to be to absolve AI companies of the enormous theft that put them in the position to hire experts in the first place.

by rafram5 hours ago|

parent|

[-]

Well, I'd never heard anyone make it before, but sure. (I looked into Mercor a bit and know some people who've worked in data generation/labeling, which is what exposed me to that side of the operation.)

It doesn't absolve them of any theft, but it does make the assertion that they should be required to release their models to the public seem, to me, a bit farcical. There are dozens of free and open-weights models that have all trained on exactly the same web crawls and books as GPT-5 and Opus. The proprietary models are better because of proprietary data.

by franga20005 hours ago|

parent|

[-]

Cool, then they can train their proprietary models on their proprietary data only.

Even if the other models were trained on the same data, which is unlikely, since they had less time and money to scrape it and fewer lawyers to be able to do something like pirate, the proprietary models are still largely built on the public data and wouldn't exist without it. At the very least, they should release the intermediate model, before training on their proprietary data. Not that that's how that works...

by thom5 hours ago|

parent|

prev|

[-]

I agree that saying that they have now trained on lots of proprietary data allows them to muddy the legislative waters further than they already have. What a happy coincidence!

by noitemtoshow4 hours ago|

parent|

[-]

I’d suggest you to learn more about how LLM training work. Training on internet data alone will not result in an agent answering your questions.

by thom3 hours ago|

parent|

[-]

Sure as shit won't answer them without that though.

by mbesto4 hours ago|

parent|

prev|

[-]

> The proprietary models are better because of proprietary data

Source? Otherwise this is pure speculation.

by jaen5 hours ago|

parent|

prev|

[-]

Indeed, that's exactly why I replied - you omitted one side from the discussion.