upvote
It's not even clear you can license language model weight though.

I'm not a lawyer but the analysis I've read had a pretty strong argument that there's no human creativity involved in the training, which is an entirely automatic process, and as such it cannot be copyrighted in any way (the same way you cannot put a license on a software artifact just because you compiled it yourself, you must have copyright ownership on the source code you're compiling).

reply
IANAL either but the answer likely depends on the jurisdiction

US standards for copyrightability require human creativity and model weights likely don’t have the right kind of human creativity in them to be copyrightable in the US. No court to my knowledge has ruled on the question as yet, but that’s the US Copyright Office’s official stance.

By contrast, standards for copyrightability in the UK are a lot weaker than-and so no court has ruled on the issue in the UK yet either, it seems likely a UK court would hold model weights to be copyrightable

So from Google/Meta/etc’s viewpoint, asserting copyright makes sense, since even if the assertion isn’t legally valid in the US, it likely is in the UK - and not just the UK, many other major economies too. Australia, Canada, Ireland, New Zealand tend to follow UK courts on copyright law not US courts. And many EU countries are closer to the UK than the US on this as well, not necessarily because they follow the UK, often because they’ve reached a similar position based on their own legal traditions

Finally: don’t be surprised if Congress steps in and tries to legislate model weights as copyrightable in the US too, or grants them some sui generis form of legal protection which is legally distinct from copyright but similar to it-I can already hear the lobbyist argument, “US AI industry risks falling behind Europe because copyrightability of AI models in the US is legally uncertain and that legal uncertainty is discouraging investment”-I’m sceptical that is actually true, but something doesn’t have to be true for lobbyists to convince Congress that it is

reply
>don’t be surprised if Congress steps in and tries to legislate model weights as copyrightable in the US too

"Your Honor i didn't copy their weights, i used them to train my models weights"

reply
> US standards for copyrightability require human creativity and model weights likely don’t have the right kind of human creativity in them to be copyrightable in the US. No court to my knowledge has ruled on the question as yet, but that’s the US Copyright Office’s official stance.

Has the US copyright office said that about model weights? I've only heard them saying that about images produced entirely from a prompt to a model.

reply
I thought I read something by them explicitly addressing the question but I can’t find it now.

However, read page 22 of https://www.copyright.gov/comp3/chap300/ch300-copyrightable-... - it is their settled position that the output of a mechanical process cannot be copyrightable unless there was substantial human creative input into it - and it is pretty clear that AI training doesn’t involve human creative input in the relevant sense. Now, no doubt there is lots of human skill and art in picking the best hyperparameters, etc - but that’s not input of the right kind. An analogy - a photocopier does not create a new copyright in the copy, even though there is skill and art in picking the right settings on the machine to produce the most faithful copy. The human creativity in choosing hyperparameters isn’t relevant to copyrightability because it isn’t directly reflected in the creative elements of the model itself

A model with RLHF fine-tuning could be a different story - e.g. Anthropic went to a lot of effort to make Claude speak with a distinctive “voice”, and some of that involved carefully crafting data to use for fine-tuning, and the model may contain some of the copyright of that training data.

But, even if that argument also applies to Gemma or Llama - if someone intentionally further fine-tunes the model in order to remove that distinctive “voice”, then you’ve removed the copyrightable element from the model and what is left isn’t copyrightable. Because the really expensive part of building a model is building the foundation model, and that’s the part least likely to be copyrightable; whereas, fine-tuning to speak with a distinctive voice is more likely to be copyrightable, but that’s the easy part, and easy to rip out (and people have motivation to do so because a lot of people desire a model which speaks with a different voice instead)

reply
A very good lawyer could argue that creating the data sets for training, doing the evals, and RLHF, constitutes -human creativity- and not a mechanical endeavor.

but who knows judges can be weird about tech

reply
Right, but it isn’t legally enough for there to be creativity in the supervision of the mechanical process - that creativity has to take the form of creative elements which survive in some identifiable form in the end product. The technical skill of managing a mechanical process can involve a great deal of creativity, but that doesn’t legally count as “creative” unless that is directly surfaced in the model output

I think the case is the strongest with RLHF - if your model speaks with a distinctive “voice”, and to make it do so you had to carefully craft training data to give it that voice, such that there are obvious similarities (shared turns of speech, etc) between your RLHF training input and the model outputs - that aspect of the model likely is copyrightable. But if you are trying to improve a model’s performance at mathematics problems, then no matter how much creativity you put into choosing training data, it is unlikely identifiable creative elements from the training data survive in the model output, which suggests that creativity didn’t actually make it into the model in the sense relevant to US copyright law

reply
In that line of reasoning, does it really matter how “close“ jurisdictions are to each other — also considering how what courts rule doesn’t matter as much in countries governed by civil law - but merely the enforcement of the Berne convention? As in, if something is considered to be under copyright in any one of all the signatory countries of it, the others have to respect that?
reply
No, the Berne convention doesn’t work that way. It requires you to extend copyright protection to the works of the nationals of the other parties on the same terms as you offer it to the works of your own nationals; but if a certain category of works are excluded from copyright for your own nationals, it doesn’t require you to recognise copyright in those works when authored by foreign nationals, even if their own country’s laws do

Real example: UK law says telephone directories are eligible for copyright, US law says they aren’t. The US is not violating the Berne convention by refusing to recognise copyright in UK phone directories, because the US doesn’t recognise copyright in US phone directories either. A violation would be if the US refused to recognise copyright in UK phone directories but was willing to recognise it in US ones

reply
Makes sense. Thanks!
reply
> It's not even clear you can license language model weight though.

It is clear you can license (give people permissions to) model weights, it is less clear that there is any law protecting them such that they need a license, but since there is always a risk of suit and subsequent loss in the absence of clarity, licenses are at least beneficial in reducing that risk.

reply
That's one of the reasons why they gate Gemini Nano with the "Gemini Nano Program Additional Terms of Service". Even if copyright doesn't subsist in the weights or if using them would be fair use, they still have recourse in breach of contract.
reply
I've wondered about this for a while now (where e.g. some models of HuggingFace require clickwrap license agreements to download, that try to prohibit you from using the model in certain ways.)

It seems to me that if some anonymous ne'er-do-well were to publicly re-host the model files for separate download; and you acquired the files from that person, rather than from Google; then you wouldn't be subject to their license, as you never so much as saw the clickwrap.

(And you wouldn't be committing IP theft by acquiring it from that person, either, because of the non-copyrightability.)

I feel that there must be something wrong with that logic, but I can't for the life of me think of what it is.

reply
The problem is that contracts don’t bind subsequent recipients, copyright does

Google gives the model to X who gives it to Y who gives it to Z. X has a contract with Google, so Google can sue X for breach of contract if they violate its terms. But do Y and Z have such a contract? Probably not. Of course, Google can put language in their contract with X to try to make it bind Y and Z too, but is that language going to be legally effective? More often than not, no. The language may enable Google to successfully sue X over Y and Z’s behaviour, but not successfully sue Y and Z directly. Whereas, with copyright, Y and Z are directly liable for violations just as X is

reply
Thank you, this is a nice point to consider. Don't know if using the weights could be considered equivalent or implying accepting the terms of services from weights creators.
reply
Contracts require agreement (a “meeting of the minds”)… if X makes a contract with Google, that contract between Google and X can’t create a contract between Google and Y without Y’s agreement. Of course, Google’s lawyers will do all they can possibly can to make the contract “transitive”, but the problem is contracts fundamentally don’t have the property of transitivity.

Now, if you are aware of a contract between two parties, and you actively and knowingly cooperate with one of them in violating it, you may have some legal liability for that contractual violation even though you weren’t formally party to the contract, but there are limits - if I know you have signed an NDA, and I personally encourage you to send me documents covered by the NDA in violation of it, I may indeed be exposed to legal liability for your NDA violation. But, if we are complete strangers, and you upload NDA-protected documents to a file sharing website, where I stumble upon them and download them - then the legal liability for the NDA violation is all on you, none on me. The owner of the information could still sue me for downloading it under copyright law, but they have no legal recourse against me under contract law (the NDA), because I never had anything to do with the contract, neither directly nor indirectly

If you download a model from the vendor’s website, they can argue you agreed to the contract as a condition of being allowed to make the download. But if you download it from elsewhere, what is the consideration (the thing they are giving you) necessary to make a binding contract? If the content of the download is copyrighted, they can argue the consideration is giving you permission to use their copyrighted work; but if it is an AI model and models are uncopyrightable, they have nothing to give when you download it from somewhere else and hence no basis to claim a contractual relationship

What they’ll sometimes do, is put words in the contract saying that you have to impose the contract on anyone else you redistribute the covered work to. And if you redistribute it in full compliance with those terms, your recipients may find themselves bound by the contract just as you are. But if you fail to impose the contract when redistributing, the recipients escape being bound for it, and the legal liability for that failure is all yours, not theirs

reply
Thanks for such a clear and logical explanation, it is a pleasure to read explanations like this. Anyway, I am always skeptical about how law is applied, sometimes the spirit of the law is bended by the weight of the powerful organizations, perhaps there are some books which explains how the spirit of the law is not applied when powerful organizations are able to tame it.
reply
Why not? Training isn't just "data in/data out". The process for training is continuously tweaked and adjusted. With many of those adjustments being specific to the type of model you are trying to output.
reply
The US copyright office’s position is basically this-under US law, copyrightability requires direct human creativity, an automated training process involves no direct human creativity so cannot produce copyright. Now, we all know there is a lot of creative human effort in selecting what data to use as input, tinkering with hyperparameters, etc - but the copyright office’s position is that doesn’t legally count - creative human effort in overseeing an automated process doesn’t change the fact that the automated process itself doesn’t directly involve any human creativity. So the human creativity in model training fails to make the model copyrightable because it is too indirect

By contrast, UK copyright law accepts the “mere sweat of the brow” doctrine, the mere fact you spent money on training is likely sufficient to make its output copyrightable, UK law doesn’t impose the same requirements for a direct human creative contribution

reply
Doesn't that imply just the training process isn't copyrightable? But weights aren't just training, they're also your source data. And if the training set shows originality in selection, coordination, or arrangement, isn't that copyrightable? So why wouldn't the weights also be copyrightable?
reply
The problem is, can you demonstrate that originality of selection and arrangement actually survives in the trained model? It is legally doubtful.

Nobody knows for sure what the legal answer is, because the question hasn’t been considered by a court - but the consensus of expert legal opinion is copyrightability of models is doubtful under US law, and the kind of argument you make isn’t strong enough to change that. As I said, different case for UK law, nobody really needs your argument there because model weights likely are copyrightable in the UK already

reply
> The problem is, can you demonstrate that originality of selection and arrangement actually survives in the trained model? It is legally doubtful.

It's particularly perilous since the AI trainers are at the same time in a position where they want to argue that copyrighted work they included in the training data don't actually survive in the trained model.

reply
For the same reason GenAI output isn't copyrightable regardless of how much time you spend tweaking your prompts.

Also i'm pretty sure none of the AI companies would really want to touch the concept of having the copyright of source data affect the weight's own copyright, considering all of them pretty much hoover up the entire Internet without caring about those copyrights (and IMO trying to claim that they should be able to ignore the copyrights of training data and also that the GenAI output is not under copyright but at the same trying trying to claim copyright for the weights is dishonest, if not outright leechy).

reply
The weights are mathematical facts. As raw numbers, they are not copyrightable.
reply
A computer program is just 0s and 1s. Harry Potter books are just raw letters or raw numbers if an ebook.

(The combination is what makes it copyrightable).

reply
In practice it's not the combination that is copyrighted (you cannot claim copyright over a binary just because you zipped it, or over a movie because you re-encoded it, for instance).

It's the “actual creativity” inside. And it is a fuzzy concept.

reply
`en_windows_xp_professional_with_service_pack_3_x86_cd_vl_x14-73974.iso` is also just raw numbers, but I believe Windows XP was copyrightable
reply
Interesting.

From what I understand, copyright only applies to the original source code, GUI and bundled icon/sound/image files. Functionality etc. would fall under patent law. So the compiled code on your .ISO for example would not only be "just raw numbers" but uncopyrightable raw numbers.

reply