Revealed: The Authors Whose Pirated Books Are Powering Generative AI

Pete Hahnloser@beehaw.org · 1 year ago

Revealed: The Authors Whose Pirated Books Are Powering Generative AI

Hot Saucerman · edit-2 1 year ago

See, I thought this was well-known when Books3 was dumped online in 2020.

https://twitter.com/theshawwn/status/1320282149329784833

This is even referenced in the article.

I guess maybe people were shocked it was really “all of Bibliotik” because they couldn’t believe someone could actually manage to keep a decent share ratio on that fucking site to not get kicked off, especially while managing to download the whole corpus. /s (I don’t know this from personal experience or anything.)

In all seriousness, however, it’s been well known for a while now that these models were being trained on copyrighted books, and the companies trying to hide their faces over it are a joke.

It’s just like always, copyright is used to punish regular ass people, but when corporations trash copyright, its all “whoopsie doodles, can’t you just give us a cost-of-doing-business-fine and let us continue raping the public consciousness for a quick buck?” Corporations steal copyrighted material all the time, but regular ass people don’t have the money to fight it. Hiding behind Fair Use while they are using it to make a profit isn’t just a joke but a travesty and the ultimate in twisting language to corporate ends.

They may have bitten off more than they can chew here, though, possibly paving way for a class-action lawsuit from writers and publishers.

MagicShel@programming.dev · edit-2 1 year ago

Seems like a clearly transformative work that would be covered under fair use. As an aside, I’ve been using AI as an writing assistant/solitary roleplaying GM for several years now and the quality of the prose can be quite good, but the authorship of stories is terrible and I can’t say they even do a good job of emulating a particular author’s style.

knotthatone@lemmy.one · 1 year ago

Clearly transformative only applies to the work a human has put in to the process. It isn’t at all clear that an LLM would pass muster for a fair use defense, but there are court cases in progress that may try to answer that question. Ultimately, I think what it’s going to come down to is whether the training process itself and the human effort involved in training the model on copyrighted data is considered transformative enough to be fair use, or doesn’t constitute copying at all. As far as I know, none of the big cases are trying the “not a copy” defense, so we’ll have to see how this all plays out.

In any event, copyright laws are horrifically behind the times and it’s going to take new legislation sooner or later.

jarfil@beehaw.org · 1 year ago

My bet is: it’s going to depend on a case by case basis.

A large enough neural network can be used to store, and then recover, a 1:1 copy of a work… but a large enough corpus can contain more data that could ever be stored in a given size neural network, even if some fragments of the input work could be recovered… so it will depend on how big of a recoverable fragment is “big enough” to call it copyright infringement… but then again, reproducing up to a whole work is considered fair use for some purposes… but not in every country.

Copyright laws are not necessarily wrong; just remove the “until author’s death plus 70 years” coverage, go back to a more reasonable “4 years since publication”, and they make much more sense.

knotthatone@lemmy.one · 1 year ago

My bet is: it’s going to depend on a case by case basis.

Almost certainly. Getty images has several exhibits in its suit against Stable Diffusion showing the Getty watermark popping up in its output as well as several images that are substantially the same as their sources. Other generative models don’t produce anything all that similar to the source material, so we’re probably going to wind up with lots of completely different and likely contradictory rulings on the matter before this gets anywhere near being sorted out legally.

Copyright laws are not necessarily wrong; just remove the “until author’s death plus 70 years” coverage, go back to a more reasonable “4 years since publication”, and they make much more sense.

The trouble with that line of thinking is that the laws are under no obligation to make sense. And the people who write and litigate those laws benefit from making them as complicated and irrational as they can get away with.

jarfil@beehaw.org · edit-2 1 year ago

In this case the Mickey Mouse Curve makes sense, just bad sense. At least the EU didn’t make it 95 years, and compromised on also 70… 🙄

MagicShel@programming.dev · 1 year ago

I agree with that. And you’re right that it’s currently in the hands of the courts. I’m not a copyright expert and I’m sure there are nuances I don’t grasp - I didn’t know fair use requires specifically human transformation if that is indeed the case. We’ll just have to see in the end whose layman’s interpretation turns out to be correct. I just enjoy the friendly, respectful collective speculation and knowledge sharing.

Hot Saucerman · edit-2 1 year ago

Seems like a clearly transformative work that would be covered under fair use.

People keep repeating this as me, but the thing is, I’ve seen what these things produce, and since the humans who created them can’t even seem to articulate what is going on inside the black box to produce output, it’s hard for me to be like “oh yeah, that human who can’t describe what is even going on to produce this totally transformed the work.” No, they used a tool to rip it up and shart it out and they don’t even seem to functionally know what goes on inside the tool. If you can’t actually describe the process of how it happens, the human is not the one doing anything transformative, the program is, and the program isn’t a human acting alone, it is a program made by humans with intent to make money off of what the program can do. The program doesn’t understand what it is transforming, it’s just shitting out results. How is that “transformative.”

I mean, it’s like fucking Superman 3 over here. “I didn’t steal a ton from everyone, just fractions of pennies from every transaction! No one would notice, it’s such a small amount.” When the entire document produced is made by slivers of hundreds of thousands of copyrighted works, it doesn’t strike me as any of it is original, nor justified in calling “Fair Use.”

MagicShel@programming.dev · 1 year ago

I can explain it quite well in layman’s terms, but a rigorous scientific/mathematical explanation is indeed beyond our current understanding.

Not a single original sentence of the original work is retained in the model. It’s essentially a massive matrix (math problem) that takes input as a seed value to determine a weighted list of likely next tokens, rolls a random number to pick one, and then does it again over and over. The more text that goes into the model, the less likely it is that any given work would be infringed. Probably every previous case of fair use is less transformative, which would have implications far beyond AI.

Hot Saucerman · edit-2 1 year ago

seed value to determine a weighted list of likely next tokens

It’s understanding of next likely tokens are all based on it’s understanding of existing, copyrighted works. This “knowledge” didn’t come from nowhere. I understand that a collage is a transformative work of art, but a human is actually involved with making that, not a human spitting garbage at a math problem, and then the math problem probabilistically calculates the thing most likely to sound like human speech, based on that corpus of previous human speech. It wouldn’t understand what to do with words if you only fed it the dictionary.

Someone shitting dumb prompts in to it randomly does not make it human made, especially if they can’t understand the math of it. Still essentially the plot of Superman 3. If I steal just a little bit from everything, each slice will be so small no one will notice.

MagicShel@programming.dev · 1 year ago

I agree with everything you are saying but that still doesn’t make it infringing just because it’s machine-generated.

Sure, computers can ingest writing faster than any human, and they can write faster than any human, which certainly gives them advantages. But humans at least bring an executive vision that an AI (at least anything based on current technology) can not duplicate.

Transformative technology can indeed by disruptive. I’m less worried about authors and more worried about copy editors. Should there be laws or rules changed to protect human creatives? Possibly. I’m not opposed to that in theory, but it would need to be carefully considered so that the solution doesn’t create bigger problems.

The objections I see are more societal issues. Stagnation of language and culture is a concern. Replacing entry level jobs so that there is no one to replace master craftsmen when they retire is another one. You raise absolutely valid concerns which I share. Actors and writers need to eat, of course, and I support the current strike and hope they come to an equitable solution.

I just don’t see how this can be considered infringement when a human could (and does) slice up a bunch of different stories to tell their own new ones just like you’re saying AI does (leaving aside whether that is a fair characterization). I don’t think that works as a tool to address these concerns. I’m not sure what the right tool is.

frog 🐸@beehaw.org · 1 year ago

Stagnation of language and culture is a concern.

I think this is a much bigger problem than a lot of the supporters of AI are willing to consider. There’s already some evidence that feeding AI-generated content into AIs makes them go a bit… strange in a way that renders their output utterly worthless. So AIs genuinely cannot create anything new on their own, only derivative works based on human-made content that is fed into them.

So, in order for AIs to progress, there still needs to be human creatives making truly original content. But if all human-made content is immediately vacuumed up into an AI, preventing the human creative from ever making a living off their work (and thus buying those pesky luxuries like food and shelter), then under the social system we have right now, humans won’t produce new creative work. They won’t be able to afford to.

Thus, the only logical solution is that if the developers of AIs want to train them on human-made works, they’re just going to have to compensate the authors, artists, etc. Otherwise the AIs will stagnate, and so will language and culture because of the threat AIs pose to the livelihoods of the people who create new language and culture. Even if humans are still creating new works, if there’s a genuine risk of it being taken by AI companies and fed into the bots, the humans will be a lot more cautious about posting their work publicly, which again leads to stagnation.

It’s almost like new technologies actually work best when the wealth they generate is distributed to everyone, not just hoarded by a few.

jarfil@beehaw.org · 1 year ago

The “knowledge” is not the copyrighted works, unless it can reproduce them in full. Shannon and amount of entropy calculations can come in handy when deciding whether a given size neural network is even capable of holding a copy of the works.

In that regard, a collage is less of a transformative work, since it fully reproduces the original works… and there is no lower bound to how much or what quality of input a human needs to add to a collage for it to be transformative, so “a human spitting garbage” sounds like a valid enough transformation.

knotthatone@lemmy.one · edit-2 1 year ago

Not a single original sentence of the original work is retained in the model.

Which is why I find it interesting that none of the court cases (as far as I’m aware) are challenging whether an LLM is copying anything in the first place. Granted, that’s the plaintiff’s job to prove, but there’s no need to raise a fair use defense at all if no copying occurred.

frog 🐸@beehaw.org · 1 year ago

The one use I’ve found for using AI is getting it to prompt me. I’d found myself between stories, unable to settle on an idea, but I had a rough idea of the kind of thing I was looking for, mostly determined by going down human-made prompts and going “nope, nope, nope, that’s crap, that’s boring, that’s idiotic, nope… FFS why isn’t there anything with X, Y, and Z?”

So off I went to ChatGPT and said “give me a writing prompt with X, Y, and Z”. What I got were some ideas that were okay, in that my response to them was more “okay, yeah, that’s better than the output of r/WritingPrompts, but that plotline is still pretty derivative. Meh.”

And then something happened a couple days later. Something clicked and an actual good idea came to me, one that I felt was actually worth developing further.

I absolutely would not want ChatGPT to do any writing for me. Not only would the end results be really derivative, but that’s just not any fun. But there was definitely something useful in the process of asking it to echo X, Y, and Z at me so that I could refine my own ideas.

Beej Jorgensen@lemmy.sdf.org · 1 year ago

I also struggle to see how authors are actually harmed by this use, which might be problematic for them in court.

Hot Saucerman · edit-2 1 year ago

So a tool that will be used to phase out human writers and will further devalue their pay was trained on the writing of the people whose work it will devalue… And you don’t see how it will hurt human writers or why they might be upset that they’ll lose their job/get paid less compared to a machine that copies their past work??

AI use is literally a sticking point in the Hollywood writers strike. Hollywood already wants to devalue writers with these tools. This isn’t hypothetical. It is literally weaponizing their own labor against them.

Beej Jorgensen@lemmy.sdf.org · 1 year ago

I’m talking about using copyrighted material to train AI; you’re talking about using AI to replace authors, which is a separate, related issue.

If someone uses Stephen King’s books to train an AI, how many sales of those books are lost? Because it kinda looks like “zero” since the AI isn’t replacing those books.

blindsight@beehaw.org · 1 year ago

I think it’s two sides of the same point; the downstream effect of LLMs is devaluing writing, and it’s trained on copyrighted works.

So, for instance, if you train a LLM on everything written by Stephen King, then ask the LLM to generate stories “in the style of Stephen King”, then you could potentially create verbatim text from his books (probabilistically, it’s bound to happen with the way LLM chains words) and/or create books similar enough to his style to be direct competition to his writing.

It’s up to the courts to decide if that argument has any legal weight, and legislators (and the public voting for them) to decide if the laws should change.

And, based on the mess that is Bill C18 in Canada, I have absolutely no confidence in new copyright laws having a lick of sense.

Beej Jorgensen@lemmy.sdf.org · 1 year ago

If it generates verbatim output, then we have a good old copyright violation, which courts could latch onto for standing.

But if I hire people to write books in the style of Stephen King and then train an AI with them, where’s King’s recourse?

And the AI could be trained on public domain data and still be a competitor to authors. It seems like the plaintiffs would have to be equally against this usage if they’re worried about their jobs.

But in those two cases, I don’t think any laws are broken.

I just think, aside from a plain old piracy violation, it’s going to be a tricky one in court. Sure you can’t just copy the book, but running a copy of a book through an algorithm is tougher to ban, and it’s not something that necessarily should be illegal.