I’ve thought about this regarding code as well: An AI is nothing without a training data set, if someone uses licensed code to train an AI, they should definitely be bound by the license. For example: If an AI is trained using copyleft licensed code, the resulting model should also be regarded as copyleft bound. As of now, I suspect this is to a very large degree being ignored.
Sure, but that particular horse has left the barn. There will be cases where identification is easy(-ier) but as shown in Oracle v Google, there are only so many ways to express ideas in code.
For example, I just asked Claude 2 “Write a program in C to count from 1 to some arbitrary number specified on the command line.” Can you tell me the origin of this line from the result?
for(int i=1; i<=n; i++) {
I mean, if it’s from a copyrighted work, I certainly don’t want to use it in an open-source project!
EDIT: Guessing there’s a bug in HTML entity handling.
Of course, once the AI is trained, you can’t look at some arbitrary output and determine whether that specific output came due to some specific training data set. In principle, if some of your training data is found to violate copyrights you either have to compensate the copyright holder or re-train the model without that data set.
Finding out whether a copyrighted work is part of the training data is a matter of going through it, and should be the responsibility of the people training the model. I would like to see a case where it has been shown that a copyrighted dataset has been used to train a model, and those violating the copyright by doing so are held responsible.
It’s not over and done with. Pass regulation saying every AI accessible w/in the country has to have a publicly available dataset. That way people can see if their works have been stolen or not. When we inevitably see works recreated wholesale without proper copyright, the AI creators can be sued or fined.
That way people can see if their works have been stolen or not.
Firstly, nothing at all is being “stolen.” The words you’re looking for are “copyright violation.”
Secondly, it does not currently appear that training an AI model on published material is a copyright violation. You’re going to have to point to some actual law indicating that. Currently that sort of thing is generally covered by fair use.
I’ve thought about this regarding code as well: An AI is nothing without a training data set, if someone uses licensed code to train an AI, they should definitely be bound by the license. For example: If an AI is trained using copyleft licensed code, the resulting model should also be regarded as copyleft bound. As of now, I suspect this is to a very large degree being ignored.
Sure, but that particular horse has left the barn. There will be cases where identification is easy(-ier) but as shown in Oracle v Google, there are only so many ways to express ideas in code.
For example, I just asked Claude 2 “Write a program in C to count from 1 to some arbitrary number specified on the command line.” Can you tell me the origin of this line from the result?
for(int i=1; i<=n; i++) {
I mean, if it’s from a copyrighted work, I certainly don’t want to use it in an open-source project!
EDIT: Guessing there’s a bug in HTML entity handling.
Of course, once the AI is trained, you can’t look at some arbitrary output and determine whether that specific output came due to some specific training data set. In principle, if some of your training data is found to violate copyrights you either have to compensate the copyright holder or re-train the model without that data set.
Finding out whether a copyrighted work is part of the training data is a matter of going through it, and should be the responsibility of the people training the model. I would like to see a case where it has been shown that a copyrighted dataset has been used to train a model, and those violating the copyright by doing so are held responsible.
It’s not over and done with. Pass regulation saying every AI accessible w/in the country has to have a publicly available dataset. That way people can see if their works have been stolen or not. When we inevitably see works recreated wholesale without proper copyright, the AI creators can be sued or fined.
Firstly, nothing at all is being “stolen.” The words you’re looking for are “copyright violation.”
Secondly, it does not currently appear that training an AI model on published material is a copyright violation. You’re going to have to point to some actual law indicating that. Currently that sort of thing is generally covered by fair use.