Revealed: The Authors Whose Pirated Books Are Powering Generative AI

Pete Hahnloser@beehaw.org · 1 year ago

Revealed: The Authors Whose Pirated Books Are Powering Generative AI

Hot Saucerman · edit-2 1 year ago

Seems like a clearly transformative work that would be covered under fair use.

People keep repeating this as me, but the thing is, I’ve seen what these things produce, and since the humans who created them can’t even seem to articulate what is going on inside the black box to produce output, it’s hard for me to be like “oh yeah, that human who can’t describe what is even going on to produce this totally transformed the work.” No, they used a tool to rip it up and shart it out and they don’t even seem to functionally know what goes on inside the tool. If you can’t actually describe the process of how it happens, the human is not the one doing anything transformative, the program is, and the program isn’t a human acting alone, it is a program made by humans with intent to make money off of what the program can do. The program doesn’t understand what it is transforming, it’s just shitting out results. How is that “transformative.”

I mean, it’s like fucking Superman 3 over here. “I didn’t steal a ton from everyone, just fractions of pennies from every transaction! No one would notice, it’s such a small amount.” When the entire document produced is made by slivers of hundreds of thousands of copyrighted works, it doesn’t strike me as any of it is original, nor justified in calling “Fair Use.”

MagicShel@programming.dev · 1 year ago

I can explain it quite well in layman’s terms, but a rigorous scientific/mathematical explanation is indeed beyond our current understanding.

Not a single original sentence of the original work is retained in the model. It’s essentially a massive matrix (math problem) that takes input as a seed value to determine a weighted list of likely next tokens, rolls a random number to pick one, and then does it again over and over. The more text that goes into the model, the less likely it is that any given work would be infringed. Probably every previous case of fair use is less transformative, which would have implications far beyond AI.

Hot Saucerman · edit-2 1 year ago

seed value to determine a weighted list of likely next tokens

It’s understanding of next likely tokens are all based on it’s understanding of existing, copyrighted works. This “knowledge” didn’t come from nowhere. I understand that a collage is a transformative work of art, but a human is actually involved with making that, not a human spitting garbage at a math problem, and then the math problem probabilistically calculates the thing most likely to sound like human speech, based on that corpus of previous human speech. It wouldn’t understand what to do with words if you only fed it the dictionary.

Someone shitting dumb prompts in to it randomly does not make it human made, especially if they can’t understand the math of it. Still essentially the plot of Superman 3. If I steal just a little bit from everything, each slice will be so small no one will notice.

MagicShel@programming.dev · 1 year ago

I agree with everything you are saying but that still doesn’t make it infringing just because it’s machine-generated.

Sure, computers can ingest writing faster than any human, and they can write faster than any human, which certainly gives them advantages. But humans at least bring an executive vision that an AI (at least anything based on current technology) can not duplicate.

Transformative technology can indeed by disruptive. I’m less worried about authors and more worried about copy editors. Should there be laws or rules changed to protect human creatives? Possibly. I’m not opposed to that in theory, but it would need to be carefully considered so that the solution doesn’t create bigger problems.

The objections I see are more societal issues. Stagnation of language and culture is a concern. Replacing entry level jobs so that there is no one to replace master craftsmen when they retire is another one. You raise absolutely valid concerns which I share. Actors and writers need to eat, of course, and I support the current strike and hope they come to an equitable solution.

I just don’t see how this can be considered infringement when a human could (and does) slice up a bunch of different stories to tell their own new ones just like you’re saying AI does (leaving aside whether that is a fair characterization). I don’t think that works as a tool to address these concerns. I’m not sure what the right tool is.

frog 🐸@beehaw.org · 1 year ago

Stagnation of language and culture is a concern.

I think this is a much bigger problem than a lot of the supporters of AI are willing to consider. There’s already some evidence that feeding AI-generated content into AIs makes them go a bit… strange in a way that renders their output utterly worthless. So AIs genuinely cannot create anything new on their own, only derivative works based on human-made content that is fed into them.

So, in order for AIs to progress, there still needs to be human creatives making truly original content. But if all human-made content is immediately vacuumed up into an AI, preventing the human creative from ever making a living off their work (and thus buying those pesky luxuries like food and shelter), then under the social system we have right now, humans won’t produce new creative work. They won’t be able to afford to.

Thus, the only logical solution is that if the developers of AIs want to train them on human-made works, they’re just going to have to compensate the authors, artists, etc. Otherwise the AIs will stagnate, and so will language and culture because of the threat AIs pose to the livelihoods of the people who create new language and culture. Even if humans are still creating new works, if there’s a genuine risk of it being taken by AI companies and fed into the bots, the humans will be a lot more cautious about posting their work publicly, which again leads to stagnation.

It’s almost like new technologies actually work best when the wealth they generate is distributed to everyone, not just hoarded by a few.

jarfil@beehaw.org · 1 year ago

The “knowledge” is not the copyrighted works, unless it can reproduce them in full. Shannon and amount of entropy calculations can come in handy when deciding whether a given size neural network is even capable of holding a copy of the works.

In that regard, a collage is less of a transformative work, since it fully reproduces the original works… and there is no lower bound to how much or what quality of input a human needs to add to a collage for it to be transformative, so “a human spitting garbage” sounds like a valid enough transformation.

knotthatone@lemmy.one · edit-2 1 year ago

Not a single original sentence of the original work is retained in the model.

Which is why I find it interesting that none of the court cases (as far as I’m aware) are challenging whether an LLM is copying anything in the first place. Granted, that’s the plaintiff’s job to prove, but there’s no need to raise a fair use defense at all if no copying occurred.