upvote
I'm a strong proponent of Open Source (TM) but I disagree with this take.

The weights are the useful artifact here. You can modify them, fine tune them and do what you want with them.

Unlike binary software there is nothing limiting that.

It is also useful to have access to the training recipes and to some extent the data. But I'm of the opinion that learning on something is not copyright infringement, so there are many circumstances where distributing the raw training data will not be possible.

For me this is like Open Office: it is open source, and largely inspired by and learned from Microsoft Office. But they don't need to distribute MS Office for Open Office to be Open Source.

In addition there are models that meet the criteria you appear to propose. The AllenAI models are a good example.

reply
The analogy falls apart very quickly. Without the training data, your modifications amount to virtually nothing compared to what these "versions" are, and the idea that you can maintain and improve on these models without the continual support of the company that owns the training data AND harnesses AND in general build instructions is not very credible. This is why it's not rare that they "dump" old versions as freeware but at some point switch to not distributing them, and mostly get away with it. As this is really not open, and the threat of an effective fork is therefore non-existent, the pressure for any one who has released freeware models to "go SaaS" is too high.

While if "Open Office" switches to a more problematic license at some point, the existing source has all you need for an organization to support the project without regard to the original company (this has happened already!). If Qwen decides to stop distributing models for download, you're basically stuck, _even_ if you have unlimited resources, it's not clear how the released weights help you; your best bet is to start almost from scratch. This has also happened...

These models are not "Open" by any definition of the word. It is just freely redistributable. You can justify yourself in whatever way you want re a cowboy approach to copyright, but this doesn't change the fact that this is not open, and has almost none of the benefits of open, and therefore it is a huge abuse of the word "Open".

Ironically about the only thing that is copyrightable here is the sum of the training data (possibly) _AND_ the software used to build the model (most definitely). The model itself most likely isn't (databases are not copyrightable), which makes it even more pointless to abuse the word "open" for it. All the value is in the former two.

reply
What would the 'source' be for an LLM? There is the structure, and the weights, there is no 'source'.
reply
In case you're not just trolling, please learn how "the weights", which are analgous to a compiled executable, are made.
reply
There is no source because it's not software. You can of course modify and make your own.
reply