I’d imagine there have been more nonsensical (than AI = public domain) legal decisions that have had the full force of law for decades.
I recently dug around for a while, and if the copyright of works in the training data affects the copyright of outputs, no popular model can output anything that would even be close to acceptable for a contribution to an open-source project. Maybe if you trained a model exclusively on “The Stack” (NOT “The Pile”) and then included all the required attributions – but no ready-made model does that. All of the “open source” model frameworks that I could find included some amount of proprietary “pre-training” data that would also be an issue.
If AI output is NOT affected by the copyright of training data… there might not BE a (legal) person that can hold any copyrights over it, which is pretty close to public domain.
Good Sire, if we are talking about only the US, then that does not matter at all. Existing copyright law and established precedents (without involving AI) already covers this. The copyright of software is handled like that of literature, so the actual content is copyrighted. More specifically the sequence of words. In order to violate the copyright of a protected work, one just has to reproduce this sequence. It is not relevant, if it was reproduced by an AI, a human, God or your cat (:D). The only exclusion to this is fair use. Whether fair use applies must be considered by a case by case basis. There are four factors that are used in deciding whether it falls under fair use. And that is considering that portions of that code are not patented. If they are, then you are screwed no matter what (unless you are allowed to use that code).
Anyhow, you are opening yourself up for litigations for sure.
Now, is this a problem? Probably not. Copyright infringement is actually very very hard to spot, especially without automated tools (looking right at you, YouTube). Even if it is spotted, the owners of the copyright must use resources in order to enforce it. Considering that most of the code used in the training data is open-source, most of these owners won’t have these resources or at least aren’t using them (which is sad, because that also applies to the infringement of companies as well). You cannot lose, if no one sues. Whether you should risk it, is anyone’s decision to make.
For unprotected code… I guess, you are right. It could be one way or the other, but it does not really matter that much. At worst, people can use your code without adhering to your license. That would not mark the end of an project, the former definitely would.
Also on another note: Using copyrighted material in the training data of AI is considered fair use.
I’d imagine there have been more nonsensical (than AI = public domain) legal decisions that have had the full force of law for decades.
I recently dug around for a while, and if the copyright of works in the training data affects the copyright of outputs, no popular model can output anything that would even be close to acceptable for a contribution to an open-source project. Maybe if you trained a model exclusively on “The Stack” (NOT “The Pile”) and then included all the required attributions – but no ready-made model does that. All of the “open source” model frameworks that I could find included some amount of proprietary “pre-training” data that would also be an issue.
If AI output is NOT affected by the copyright of training data… there might not BE a (legal) person that can hold any copyrights over it, which is pretty close to public domain.
Good Sire, if we are talking about only the US, then that does not matter at all. Existing copyright law and established precedents (without involving AI) already covers this. The copyright of software is handled like that of literature, so the actual content is copyrighted. More specifically the sequence of words. In order to violate the copyright of a protected work, one just has to reproduce this sequence. It is not relevant, if it was reproduced by an AI, a human, God or your cat (:D). The only exclusion to this is fair use. Whether fair use applies must be considered by a case by case basis. There are four factors that are used in deciding whether it falls under fair use. And that is considering that portions of that code are not patented. If they are, then you are screwed no matter what (unless you are allowed to use that code).
Anyhow, you are opening yourself up for litigations for sure.
Now, is this a problem? Probably not. Copyright infringement is actually very very hard to spot, especially without automated tools (looking right at you, YouTube). Even if it is spotted, the owners of the copyright must use resources in order to enforce it. Considering that most of the code used in the training data is open-source, most of these owners won’t have these resources or at least aren’t using them (which is sad, because that also applies to the infringement of companies as well). You cannot lose, if no one sues. Whether you should risk it, is anyone’s decision to make.
For unprotected code… I guess, you are right. It could be one way or the other, but it does not really matter that much. At worst, people can use your code without adhering to your license. That would not mark the end of an project, the former definitely would.
Also on another note: Using copyrighted material in the training data of AI is considered fair use.