I’m not a lawyer, but my understanding of a license is that it gives me permission to use/distribute something that’s otherwise legally protected. For instance, software code is protected by copyright, and FOSS licenses give me the right to distribute it under some conditions.

However, LLMs are produced by a computer, and aren’t covered by copyright. So I was hoping someone who has better understanding of law to answer some questions for me:

  1. Is there some legal framework that protects AI models, so that I’d need a license to distribute them? How about using them, since many licenses do restrict use as well.

  2. If the answer to the above is no: By mentioning, following and normalizing LLM licenses, are we essentially helping establish the principle that we do need permission from companies to use their models, and that they have the right to restrict us?

  • Atemu
    link
    fedilink
    English
    arrow-up
    4
    ·
    1 year ago

    IANAL. TINLA.

    The machine producing the derivative work is a thing which means it cannot have a copyright on anything. If it did anything original somehow, that work would be in the public domain.

    The weights of the model would likely be considered a derivative work of the training data however because it was directly created using the training data. Thus, the copyright of the weights belongs to whoever owns the copyright to the training data.

    The training data is created from thousands/millions/billions of individually copyrighted works. This would also constitute a derivative work too but there’s an escape hatch: Fair use. If the use of the original works is transformative enough, the creator of the derivative work retains their copyright.
    Collecting the data on which the weights are created is (somewhat) manual work done by humans. You could make good argument for this being fair use.

    It all hinges on whether or not this is true. If it is, ML companies will continue as they did. If it isn’t, the people creating the datasets would need to have to license the individual works they used for the training data from the respective copyright holders.

    In practice, nothing is black and white and this is still a hotly debated topic for which no clear answer exists. None of this is court-tested to my knowledge.

    OTOH: There’s another legal question here: Is creating weights from training data fair use or a derivative work? If it’s fair use, that’d mean whoever creates the weights gets the copyright which, in this case, is a machine; meaning nearly all ML models would be public domain.


    Opinion and wild speculation:

    Creating weights out of training data being fair use would be …interesting but I doubt that will happen. It’s sometimes even fairly obvious that some weights are a derivative work of their training data because you can make the weights reproduce training data very closely in some cases.

    I am fairly certain that model weights will be considered a derivative work of the training data; copyright of the weights belongs to whoever owns the copyright to the training data.

    What I suspect will happen on the training data front is that the collection and tagging will (at some point) be considered a transformative action, making it fair use.

    I think this way because artists do not have a lobby, so even if the judiciary decided that collecting training data wasn’t fair use, the rich tech companies will get their way because they can wooo the legislative using their “”“AI”“”; creating new copyright exceptions such that aristocrat pockets can continue to be filled with peasant money.
    Far more convincing is the contraposition: If collecting training data wasn’t fair use, that would be to the benefit of the peasants; ML companies would have to license works from individual artists and pay them license fees. We can’t have aristocrat money going into peasant pockets.

  • Dodecahedron December@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago
    1. Nope. AI work can’t even be copywritten.
    2. This is how licenses have always worked. Company makes thing, licenses it. Doesn’t matter really what that thing is.

    But keep in mind there are models and there are weights. Models can be open sourced. Weights generally are not.

  • rufus@discuss.tchncs.de
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    LLMs aren’t produced by a computer. They are produced with the help of a computer as a tool. Like a novel is typed on a computer and the computer has been a tool in the process of creating that book. You’d probably agree the author has the copyright.

    It’s the same with LLMs. The companies need lots of programmers to develop the software that does the training. Scientists who figure out what numbers to multiply to get coherent text. More people to scrape as much text as possible and curate the datasets. It’s really complicated.

    Sure in the end a computer does the calculations. But that doesn’t make it the author. Neither does your digital camera take your photos or your hammer tile your roof.