upvote
I could not agree more with EFF.

There’s a difference between training a model and using a model. Training involves copyrighted works but fair use is not just about use of copyrighted works, it’s about whether the use is transformative and substitutes the original market. I struggle to see how is not transformative under these criteria.

The use of the model (being able to output copies of GPL software) is a different question. This depends on the circumstances: if GPL code is exactly reproduced then it very well could be subject to the license of the original work.

I don’t understand the legal objections to the fair use of protected IP. Licenses are legal documents, not moral imperatives. GPL only exists because of copyright law, and you can’t write a license that supercedes copyright law if you don’t like the law.

The Claude Code example is completely different, hosting a repo with the leaked code is clearly not fair use.

reply
Is it actually possible to determine how much the weights were influenced by each work?

I might recall reading some interpretability paper years ago that trained a special model that could attribute each answer to a part of the corpus (like Wikipedia, ArXiV, or "Blogs") but it had a non-zero effect on performance and wasn't nearly as straightforward as weights go in, attribution comes out.

reply
It’s very possible to determine similar works that existed earlier, and from that, recover attribution.

The “downside” is you may attribute similar works that weren’t inspirations, but coincidental. But I think that’s an upside: when someone discovers something novel and great but their work fails because of bad luck or non-novel details, then the discovery is finally recognized in another work, I think they should still be attributed.

reply
>Is it actually possible to determine how much the weights were influenced by each work?

It will be very possible once they become the owners of the intellectual property being infringed. Think about how it was "impossible" to implement DRM on music and movies in the early days of youtube. Now, Google owns the content and platform, and suddenly their "rolling cypher" which involves no encryption at all is supposedly enforcable DRM.

The Silicon Valley tech bros play the same game every time. They violate the law, say it's just too darn difficult to obey the law without stifling progress, and then they get away with it until they kill all the competition. At which point, the law is once again applicable to anyone that might try to challenge them.

Remember how Amazon destroyed all the other retailers when they had a decade of no sales tax while brick & mortar had to obey it. "Calculating sales tax for 50 different states?! That's impossible!!!" What a load of shit...

Now, knowing that they're going to do this playbook again, how do you think it's going to play out? We've already seen it. Anthropic steals your copyrighted code, puts together their claude code project, the code for that project leaks, but now THEY own it! They sent DMCA takedowns on that AI generated code. AI generated code enjoys no copyright protection, it cannot be DMCAed under the law, there's no copyright on it. But Anthropic claims there is, and Github will obey the takedown, and nobody has the money to step up and stop them.

See where this is going? Once they achieve market dominance, they will claim that all the code generated by claude belongs to Anthropic, your prompts belong to you, but THEIR machine generated THEIR code and you only purchase a license to it with your tokens. A limited license. It might be revokable, it might expire, maybe you need to pay an annual fee to keep using THEIR code Claude generated for you. And if you actually just write code on your own, without Claude? Well, prepare to be sued like a network printer is sued by the RIAA because that's going to happen too. They will have their robot scour your code for "fair use" training and discover that it's just too similar to something their machine generated a year earlier. Sorry open source programmer, here's your legalese nasty gram. It appears you owe Anthropic some money.

reply
I do not defend the current state of things where a select few companies get to shamelessly violate the law with the entire legal framework bending around the weight of the money trapped in this speculative bubble.

I believe LLMs are at the very least an under-researched technology or less charitably, an ongoing effort to strip intellectual workers of their rights and privileges.

What I am saying is the reasonable demand for attribution runs counter to the nature of these systems as we know them. There is no magical "release the attribution" button Anthropic could press if they wanted to. Unlike per-state taxes, are actual PhDs working on, at universities and private labs, because transparency has been the public number one demand since day one, and yet all that exists after 4 years of funding are only the first incomplete steps.

The most likely outcome of imposing this obligation is commercial LLM providers quickly folding, finding a loophole/displaying false attribution, or settling for notably worse performance. That is of course not counting how these companies will be on the hook for a civilizational amount of licensing fees.

(Per the DRM point, I believe we can agree the goal of simultaneously displaying a piece of media in the physical world and somehow protecting the viewer from storing it is effectively impossible, without hiring a trusted guard to hold the viewer at gunpoint if they dare touch the trusted viewing apparatus or pull out their phone, at least in its strict form)

I am personally okay with shutting down an industry that cannot legally exist in its current form, especially one so openly hostile to every field of human endeavor. But no matter your position on that, we must keep in mind no "ethical" or "legal" AI industry can exist without making either adjective meaningless.

reply
> An LLM is a predict the next word algorithm.

This is what's known as a category error; an LLM is a 'model', not an algorithm.

It's not even an accurate claim; LLMs predict the next token, not the next word.

> AI is essentially copy paste with more steps

What about when AI creates a limerick about a kubernetes cluster run by Buddhist Monks? Or any number of other novel creations?

Fortunately the courts recognized the transformative use involved in making a model, which is fair use of copyrighted works, in kadrey v meta platforms.

> The most infurating thing however is how AI companies sidestep the IP rights of authors

transformative use falls under fair use, permission from authors is not needed to use legally acquired copyright works for training. Kadrey v Meta Platforms and Bartz v Anthropic.

> but then claim to own those IP rights when their own generated output leaks.

Corporations gonna do corporate things. Blatant hypocrisy is par for the course. Organize and take them to court.

reply
> They can determine how much each work contributed based on those weights, so it's dishonest for them to argue it isn't possible.

I don’t know about impossible but it’s definitely not a straightforward read from the post-training weights as you’re implying, unless you’re aware of some technique I’m not aware of.

The closest you could get would be the weight differential from training with a given work. But that’s massively dependent on training order, so that it’s certainly not at all a good measure of “contribution.”

reply
Agreed. Moreover, the authors of copyright law could never have anticipated this type and scale of abuse. Maybe the companies are legally in the right, maybe not, but that's irrelevant for the question of whether it's ethical. The EFF's post definitely goes against their mission to "ensure that technology supports freedom, justice, and innovation for all people of the world."
reply