Sarah Silverman and other authors are suing OpenAI and Meta for copyright infringement, alleging that they're training their LLMs on books via Library Genesis and Z-Library

Arthur Besse · edit-2 2 years ago

Sarah Silverman and other authors are suing OpenAI and Meta for copyright infringement, alleging that they're training their LLMs on books via Library Genesis and Z-Library

Arthur Besse · 2 years ago

Seems very improbable that they scraped a pirate website with forced registration and tight daily download limits (10 books a day max?)

Huh?

https://annas-blog.org/help-seed-zlibrary-on-ipfs.html

https://libgen.rs/repository_torrent/

Moonrise2473@feddit.it · 2 years ago

The website is like that.

Still seems improbable that they committed massive piracy by specifically searching and downloading illegal torrents

Arthur Besse · 2 years ago

https://www.theverge.com/2023/7/9/23788741/sarah-silverman-openai-meta-chatgpt-llama-copyright-infringement-chatbots-artificial-intelligence-ai says:

The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”

Moonrise2473@feddit.it · 2 years ago

if meta used an illegal source (which is extremely stupid, like using drug money to open a bank) it does not mean google or openai did the same

the meta model is not public, probably for that reason, they just trained it with dirty data for research just to see the feasibility

for fun, i searched the most obscure and niche recent book that i could think: 9791280546517 “Vado e tornerò da voi. Riflessioni sulla Pasqua e sulla Pentecoste”. It’s so niche that’s impossible to find a pirated or even a legit ebook copy. Even if it was published a few months ago, bing AI was able to produce an excerpt and even a short review.

Arthur Besse · 2 years ago

the meta model is not public, probably for that reason, they just trained it with dirty data for research just to see the feasibility

Meta’s LLaMA model actually is publicly available; they released it widely to anyone with a .edu email address and of course it soon ended up on bittorrent. Here is the 🧲 link (which you can also hilariously still find in this pull request, despite the DMCA takedowns they’ve sent elsewhere about it).

Sarah Silverman and other authors are suing OpenAI and Meta for copyright infringement, alleging that they're training their LLMs on books via Library Genesis and Z-Library

Sarah Silverman and other authors are suing OpenAI and Meta for copyright infringement, alleging that they're training their LLMs on books via Library Genesis and Z-Library

Sarah Silverman Sues ChatGPT Creator for Copyright Infringement