Licenses for LLM models

lily33@lemm.ee · 2 years ago

Licenses for LLM models

Atemu · 2 years ago

IANAL. TINLA.

The machine producing the derivative work is a thing which means it cannot have a copyright on anything. If it did anything original somehow, that work would be in the public domain.

The weights of the model would likely be considered a derivative work of the training data however because it was directly created using the training data. Thus, the copyright of the weights belongs to whoever owns the copyright to the training data.

The training data is created from thousands/millions/billions of individually copyrighted works. This would also constitute a derivative work too but there’s an escape hatch: Fair use. If the use of the original works is transformative enough, the creator of the derivative work retains their copyright.
Collecting the data on which the weights are created is (somewhat) manual work done by humans. You could make good argument for this being fair use.

It all hinges on whether or not this is true. If it is, ML companies will continue as they did. If it isn’t, the people creating the datasets would need to have to license the individual works they used for the training data from the respective copyright holders.

In practice, nothing is black and white and this is still a hotly debated topic for which no clear answer exists. None of this is court-tested to my knowledge.

OTOH: There’s another legal question here: Is creating weights from training data fair use or a derivative work? If it’s fair use, that’d mean whoever creates the weights gets the copyright which, in this case, is a machine; meaning nearly all ML models would be public domain.

Opinion and wild speculation:

Creating weights out of training data being fair use would be …interesting but I doubt that will happen. It’s sometimes even fairly obvious that some weights are a derivative work of their training data because you can make the weights reproduce training data very closely in some cases.

I am fairly certain that model weights will be considered a derivative work of the training data; copyright of the weights belongs to whoever owns the copyright to the training data.

What I suspect will happen on the training data front is that the collection and tagging will (at some point) be considered a transformative action, making it fair use.

I think this way because artists do not have a lobby, so even if the judiciary decided that collecting training data wasn’t fair use, the rich tech companies will get their way because they can wooo the legislative using their “”“AI”“”; creating new copyright exceptions such that aristocrat pockets can continue to be filled with peasant money.
Far more convincing is the contraposition: If collecting training data wasn’t fair use, that would be to the benefit of the peasants; ML companies would have to license works from individual artists and pay them license fees. We can’t have aristocrat money going into peasant pockets.