undefined

upvote

points

by calny10 hours ago |

upvote

by DrammBA7 hours ago|

[-]

Also the maintainer's ground-up rewrite argument is very flimsy when they used chardet's test-data and freely admit to:

> I've been the primary maintainer and contributor to this project for >12 years

> I have had extensive exposure to the original codebase: I've been maintaining it for over a decade. A traditional clean-room approach involves a strict separation between people with knowledge of the original and people writing the new implementation, and that separation did not exist here.

> I reviewed, tested, and iterated on every piece of the result using Claude.

> I was deeply involved in designing, reviewing, and iterating on every aspect of it.

reply

upvote

by Lerc8 hours ago|

[-]

There was a paper that proposed a content based hashing mask for traning

The idea is you have some window size, maybe 32 tokens. Hash it into a seed for a pseudo random number generator. Generate random numbers in the range 0..1 for each token in the window. Compare this number against a threshold. Don't count the loss for any tokens with a rng value higher than the threshold.

It learns well enough because you get the gist of reading the meaning of something when the occasional word is missing, especially if you are learning the same thing expressed many ways.

It can't learn verbatim however. Anything that it fills in will be semantically similar, but different enough to get cause any direct quoting onto another path after just a few words.

reply

upvote

by fc417fc8024 hours ago|

[-]

> you get the gist of reading the meaning of something when the occasional word is missing,

I think it's more subtle than that. IIUC the tokens were all present for the purpose of computing the output and the score is based on the output. It's only the weight update where some of the tokens get ignored. So the learning is lossy but the inference driving the learning is not.

Rather than a book that's missing words it's more like a person with a minor learning disability that prevents him from recalling anything perfectly.

However it occurs to me that data augmentation could easily break the scheme if care isn't taken.

reply

upvote

by calny6 hours ago|

[-]

Thanks! Appreciate the response and will look into this

reply

upvote

by layer87 hours ago|

[-]

> Is anyone working on this?

There was recently https://news.ycombinator.com/item?id=47131225.

reply

upvote

by calny6 hours ago|

[-]

Thanks! I missed that. The attribution by training data source category (arxiv vs wikipedia vs nemotron etc.) is an interesting approach.

reply

upvote

by oofbey9 hours ago|

[-]

The difference in indemnification based on which plan you’re on is super important. Thanks for pointing that out - never would have thought to look.

reply

upvote

by amelius8 hours ago|

[-]

Is this clause even legally valid?

How can the user know if the LLM produces anything that violates copyright?

(Of course they shouldn't have trained it on infringing content in the first place, and perhaps used a different model for enterprise, etc.)

reply

upvote

by galaxyLogic8 hours ago|

[-]

"... If AI-generated code cannot be copyrighted (as the courts suggest) ".

So, Supreme Court has said that. AI-produced code can not be copyrighted. (Am I right?). Then who's to blame if AI produces code large portions of which already exist coded and copyrigted by humans (or corporations).

I assume it goes something like this:

A) If you distribute code produced by AI, YOU cannot claim copyright to it.

B) If you distribute code produced by AI, YOU CAN be held liable for distributing it.

reply

upvote

by jcranmer6 hours ago|

[-]

SCOTUS hasn't ruled on any AI copyright cases yet. But they've said in Feist v Rural (1991) that copyright requires a minimum creative spark. The US Copyright Office maintains that human authorship is required for copyright, and the 9th Circuit in 2019 explicitly agreed with the law that a non-human animal cannot hold any copyright.

Functionally speaking, AI is viewed as any machine tool. Using, say, Photoshop to draw an image doesn't make that image lose copyright, but nor does it imbue the resulting image with copyright. It's the creativity of the human use of the tool (or lack thereof) that creates copyright.

Whether or not AI-generated output a) infringes the copyright of its training data and b) if so, if it is fair use is not yet settled. There are several pending cases asking this question, and I don't think any of them have reached the appeals court stage yet, much less SCOTUS. But to be honest, there's a lot of evidence of LLMs being able to regurgitate training inputs verbatim that they're capable of infringing copyright (and a few cases have already found infringement in such scenarios), and given the 2023 Warhol decision, arguing that they're fair use is a very steep claim indeed.

reply

upvote

by larodi6 hours ago|

[-]

The lack thereof (of human use). Prompts are not copyrightable thus the output also - not. Besides retelling a story is fair use, right? Otherwise we should ban all generative AI and prepare for Dune/Foundation future. But we not there, and we perhaps never going to be.

So the LLM training first needs to be settled, then we talk whether retelling a whole software package infringes anyone's right. And even if it does, there are no laws in place to chase it.

reply

upvote

by fc417fc8023 hours ago|

[-]

> Prompts are not copyrightable

Surely that varies on a case by case basis? With agentic coding the instructions fed in are often incredibly detailed.

reply

upvote

by galaxyLogic10 minutes ago|

[-]

In practice the output of the LLM does not tell what the prompt was, and the output varies randomly, so it is unlikely you would be sued for copying the prompt. And in fact you would not know what the prompt, if any, was for the original unless you copied the prompt from somewhere.

reply

upvote

by jcranmer5 hours ago|

[-]

> Besides retelling a story is fair use, right?

Actually, most of the time, it is not.

reply

upvote

by tzs6 hours ago|

[-]

The Supreme Court has not ruled on this issue. An appeal of a lower court's ruling on this issue was appealed to the Supreme Court but the Supreme Court declined to accept the case.

The Supreme Court has "original jurisdiction" over some types of cases, which means if someone brings such a case to them they have to accept it and rule on it, and they have "discretionary jurisdiction" over many more types of cases, which means if someone brings one of those they can choose whether or not they have to accept it. AI copyright cases are discretionary jurisdiction cases.

You generally cannot reliable infer what the Supreme Court thinks of the merits of the case when they decline to accept it, because they are often thinking big picture and longer term.

They might think a particular ruling is needed, but the particular case being appealed is not a good case to make that ruling on. They tend to want cases where the important issue is not tangled up in many other things, and where multiple lower appeals courts have hashed out the arguments pro and con.

When the Supreme Court declines the result is that the law in each part of the country where an appeals court has ruled on the issue is whatever that appeals court ruled. In parts of the country where no appeals court has ruled, it will be decided when an appeal reaches their appeals courts.

If appeals courts in different areas go in different directions, the Supreme Court will then be much more likely to accept an appeal from one of those in order to make the law uniform.

reply

upvote

by 0x4573 hours ago|

[-]

But this means code generated by snippet expanders or any sort of templates is non-copyrightable.

reply

upvote

by throwup2387 hours ago|

[-]

IANAL but I was under the impression that Supreme Court ruling was very specific to the AI itself copyrighting its own produced code. Once a human is involved, it gets a lot more complicated and rests on whether the human's contribution was substantial enough to make it copyrightable under their person.

reply

upvote

by galaxyLogic2 minutes ago|

[-]

A fun exercise: When Supreme Court has not ruled on any open legal question of interest, let's ask AI what would be a likely ruling by Supreme Court.

I think SCOTUS might in fact use AI to get a set of possible interpretations of the law, before they come up with their decision.

reply

upvote

by 7 hours ago|

[-]

deleted

reply

upvote

by popalchemist4 hours ago|

[-]

Copyright does not cover ideas. Only specific executions of ideas. So unless it's a line-by-line copy (unlikely) there is no recourse for someone to sue for a re-execution/reimplementation of an idea.

reply

upvote

by aeon_ai8 hours ago|

[-]

You've likely paid attention to the litigation here. Regardless of what remains to be litigated, the training in and of itself has already been deemed fair use (and transformative) by Alsup.

Further, you know that ideas are not protected by copyright. The code comparison in this demonstrates a relatively strong case that the expression of the idea is significantly different from that of the original code.

If it were the case that the LLM ingested the code and regurgitated it (as would be the premise of highlighting the training data provenance), that similarity would be much higher. That is not the case.

reply

upvote

by calny7 hours ago|

[-]

You're right, I've followed the litigation closely. I've advocated for years that "training is fair use" and I'm generally an anti-IP hawk who DEFENDS copyright/trademark cases. Only recently have I started to concede the issue might have more nuance than "all training is fair use, hard stop." And I still think Judge Alsup got it right.

That said, even if model training is fair use, model output can still be infringing. There would be a strong case, for example, if the end user guides the LLM to create works in a way that copies another work or mimics an author or artist's style. This case clearly isn't that. On the similarity at issue here, I haven't personally compared. I hope you're right.

reply

upvote

by overfeed4 hours ago|

[-]

> The code comparison in this demonstrates a relatively strong case that the expression of the idea is significantly different from that of the original code.

Can I use one AI agent to write detailed tests based on disassembled Windows, and another to write code that passes those same function-level tests? If so, I'm about to relicense Windows 11 - eat my shorts, ReactOS!

reply