I’m not a lawyer, but my understanding of a license is that it gives me permission to use/distribute something that’s otherwise legally protected. For instance, software code is protected by copyright, and FOSS licenses give me the right to distribute it under some conditions.

However, LLMs are produced by a computer, and aren’t covered by copyright. So I was hoping someone who has better understanding of law to answer some questions for me:

  1. Is there some legal framework that protects AI models, so that I’d need a license to distribute them? How about using them, since many licenses do restrict use as well.

  2. If the answer to the above is no: By mentioning, following and normalizing LLM licenses, are we essentially helping establish the principle that we do need permission from companies to use their models, and that they have the right to restrict us?

  • Atemu
    link
    fedilink
    English
    arrow-up
    4
    ·
    1 year ago

    IANAL. TINLA.

    The machine producing the derivative work is a thing which means it cannot have a copyright on anything. If it did anything original somehow, that work would be in the public domain.

    The weights of the model would likely be considered a derivative work of the training data however because it was directly created using the training data. Thus, the copyright of the weights belongs to whoever owns the copyright to the training data.

    The training data is created from thousands/millions/billions of individually copyrighted works. This would also constitute a derivative work too but there’s an escape hatch: Fair use. If the use of the original works is transformative enough, the creator of the derivative work retains their copyright.
    Collecting the data on which the weights are created is (somewhat) manual work done by humans. You could make good argument for this being fair use.

    It all hinges on whether or not this is true. If it is, ML companies will continue as they did. If it isn’t, the people creating the datasets would need to have to license the individual works they used for the training data from the respective copyright holders.

    In practice, nothing is black and white and this is still a hotly debated topic for which no clear answer exists. None of this is court-tested to my knowledge.

    OTOH: There’s another legal question here: Is creating weights from training data fair use or a derivative work? If it’s fair use, that’d mean whoever creates the weights gets the copyright which, in this case, is a machine; meaning nearly all ML models would be public domain.


    Opinion and wild speculation:

    Creating weights out of training data being fair use would be …interesting but I doubt that will happen. It’s sometimes even fairly obvious that some weights are a derivative work of their training data because you can make the weights reproduce training data very closely in some cases.

    I am fairly certain that model weights will be considered a derivative work of the training data; copyright of the weights belongs to whoever owns the copyright to the training data.

    What I suspect will happen on the training data front is that the collection and tagging will (at some point) be considered a transformative action, making it fair use.

    I think this way because artists do not have a lobby, so even if the judiciary decided that collecting training data wasn’t fair use, the rich tech companies will get their way because they can wooo the legislative using their “”“AI”“”; creating new copyright exceptions such that aristocrat pockets can continue to be filled with peasant money.
    Far more convincing is the contraposition: If collecting training data wasn’t fair use, that would be to the benefit of the peasants; ML companies would have to license works from individual artists and pay them license fees. We can’t have aristocrat money going into peasant pockets.