Licenses for LLM models

lily33@lemm.ee · 1 year ago

Licenses for LLM models

Atemu · 1 year ago

IANAL. TINLA.

The machine producing the derivative work is a thing which means it cannot have a copyright on anything. If it did anything original somehow, that work would be in the public domain.

The weights of the model would likely be considered a derivative work of the training data however because it was directly created using the training data. Thus, the copyright of the weights belongs to whoever owns the copyright to the training data.

The training data is created from thousands/millions/billions of individually copyrighted works. This would also constitute a derivative work too but there’s an escape hatch: Fair use. If the use of the original works is transformative enough, the creator of the derivative work retains their copyright.
Collecting the data on which the weights are created is (somewhat) manual work done by humans. You could make good argument for this being fair use.

It all hinges on whether or not this is true. If it is, ML companies will continue as they did. If it isn’t, the people creating the datasets would need to have to license the individual works they used for the training data from the respective copyright holders.

In practice, nothing is black and white and this is still a hotly debated topic for which no clear answer exists. None of this is court-tested to my knowledge.

OTOH: There’s another legal question here: Is creating weights from training data fair use or a derivative work? If it’s fair use, that’d mean whoever creates the weights gets the copyright which, in this case, is a machine; meaning nearly all ML models would be public domain.

Opinion and wild speculation:

Creating weights out of training data being fair use would be …interesting but I doubt that will happen. It’s sometimes even fairly obvious that some weights are a derivative work of their training data because you can make the weights reproduce training data very closely in some cases.

I am fairly certain that model weights will be considered a derivative work of the training data; copyright of the weights belongs to whoever owns the copyright to the training data.

What I suspect will happen on the training data front is that the collection and tagging will (at some point) be considered a transformative action, making it fair use.

I think this way because artists do not have a lobby, so even if the judiciary decided that collecting training data wasn’t fair use, the rich tech companies will get their way because they can wooo the legislative using their “”“AI”“”; creating new copyright exceptions such that aristocrat pockets can continue to be filled with peasant money.
Far more convincing is the contraposition: If collecting training data wasn’t fair use, that would be to the benefit of the peasants; ML companies would have to license works from individual artists and pay them license fees. We can’t have aristocrat money going into peasant pockets.

Dodecahedron December@sh.itjust.works · 1 year ago

Nope. AI work can’t even be copywritten.
This is how licenses have always worked. Company makes thing, licenses it. Doesn’t matter really what that thing is.

But keep in mind there are models and there are weights. Models can be open sourced. Weights generally are not.

rufus@discuss.tchncs.de · 1 year ago

LLMs aren’t produced by a computer. They are produced with the help of a computer as a tool. Like a novel is typed on a computer and the computer has been a tool in the process of creating that book. You’d probably agree the author has the copyright.

It’s the same with LLMs. The companies need lots of programmers to develop the software that does the training. Scientists who figure out what numbers to multiply to get coherent text. More people to scrape as much text as possible and curate the datasets. It’s really complicated.

Sure in the end a computer does the calculations. But that doesn’t make it the author. Neither does your digital camera take your photos or your hammer tile your roof.