Training is not Copying

The Government of India has proposed a hybrid model to address the copyright concerns around generative AI. As much as it needs to be commended for thinking outside the box, the solution they have come up with is fatally flawed.
This article first appeared in the Mint. You can read the original here. If you would like to receive these articles in your inbox every week please consider subscribing by clicking on this link.
In my Ex Machina article last week, I argued that the working paper issued by the Department for Promotion of Industry and Internal Trade (DPIIT) on copyright and artificial intelligence (AI) falls short of its objective because the mandatory blanket licensing regime it proposes transfers wealth away from the very creators it was supposed to protect.
But as bad as this suggestion is, it is not the most egregious conceptual shortcoming of the report. Far more disconcerting are the assumptions it makes about how AI systems are trained and its suggestion that this process infringes the Copyright Act of 1957.
Does AI Copy?
The verb ‘copy’ lies at the heart of many operational activities around which the Copyright law has been designed. A ‘copy’ has always referred to a reproduction of a work clearly identifiable as having been substantially derived from the original. However, it has never treated the act of learning from a work as equivalent to reproducing it.
In the early days, copies referred to physical reproductions made by a printing press or other mechanical devices designed for this purpose. Since then, it has been extended to the many digital duplicates we encounter today—most of which will never physically exist. It is this concept that is now being extended to AI training.
To qualify as a copyright infringement, it must be established that the process of training an AI model results in the creation and storage of reproductions of copyrighted works in a form that is intelligible, expressive and capable of substitution. This, however, is not how models are trained.
The Training Process
When an AI model trains on a given corpus of text, it converts that training data into vectors—strings of numbers that form coordinates in a high-dimensional space. This allows the concepts contained in the text to be represented as distinct points in a multidimensional matrix so that the relationships between them can be mathematically encoded in terms of the distance and direction separating them.
As a result, training strips away the original expression of an author’s prose—the rhythm of sentences, choice of specific words and the ordering of paragraphs—to reveal the abstract concepts that the AI model requires (in much the same way a human learns from a text when reading it). It then uses this to reduce the uncertainty of its own next-token prediction capabilities, thereby improving its ability to respond to prompts.
While that may not have been its express intention, this training process embodies a workflow that falls outside the scope of what copyright law seeks to protect. Indian courts have consistently held that there can be no copyright of an idea; only of a form, manner, arrangement or expression of it.
As we have seen, the training process strips out those elements of the work that are entitled to copyright protection, and, as a result, there is no question of copyright violation during training.
No ‘copies’ are stored—not of a book’s plot, nor snippets of its text, let alone the book as a whole. Any copies that may have been generated are transient, unintelligible and non-expressive. All the model retains are the ideas and concepts expressed through the work, which is precisely what Indian courts have held that copyright protection does not extend to.
To be clear, the DPIIT paper was motivated by legitimate concerns about the impact of AI on creative industries. The many authors, artists and journalists whose works are being used to train AI models worry that these AI systems will be able to produce content that rivals their own, often in a fraction of the time it takes them to generate similar outputs.
This, in turn, makes them fear for their livelihoods and their continued ability to eke out an existence in a future where AI democratizes creativity to such an extent that the skills they have accumulated over a lifetime become replaceable.
Focus on the Outputs
As much as these concerns merit serious consideration, if we are looking to apply copyright law in order to find a solution, our scope of operation will be limited to what that law is capable of protecting. That being the case, the DPIIT’s approach—alleging copyright violation during the training process—is unlikely to succeed, as it rests on a poor understanding of how AI models are trained.
What would be far more effective is focusing on the other end of the workflow—on the outputs that these models generate. If it can be shown that an AI system, in response to a prompt, has reproduced a substantial portion of a copyrighted work, that would be clear evidence of copyright infringement, since the output would constitute a substantial copy of an author’s work. Since no permission was taken to generate such an output, the response could constitute a violation of copyright law and entitle the author to appropriate legal remedies.
These remedies should be available under the existing copyright law, but to the extent that the DPIIT feels that this is not abundantly clear, it could suggest amendments that more effectively protect the rights of authors under these circumstances.
What the DPIIT should refrain from doing, at all cost, is extend the scope of copyright to the AI training cycle. After all, should the department do so, it will become hard to explain why the same logic should not apply to human learning as well—an outcome copyright law has always been careful to avoid.
