When GitHub Copilot was launched in 2021, the fact that its training data included a vast amount of Open Source code publicly available on GitHub attracted significant attention, sparking lively debates regarding licensing. While there were issues concerning conditions such as attribution required by most licenses, there was a particularly high volume of discourse suggesting that the conditions of copyleft licenses, such as the GNU General Public License (GNU GPL), would propagate to the model itself, necessitating that the entire model be released under the same license. The propagation of the GPL is a concept that many modern software engineers have naturally accepted; thus, for an engineer with a straightforward sensibility, it is a perfectly natural progression to think that if GPL code is included in some form, copyleft applies and the license propagates.
However, as of 2025, the theory that the license of the source code propagates to AI models trained on Open Source code is not seen as frequently as it was back then. Although some ardent believers in software freedom still advocate for such theories, it appears they are being overwhelmed by the benefits of AI coding, which has overwhelmingly permeated the programming field. Amidst this trend, even I sometimes succumb to the illusion that such a theory never existed in the first place.
Has the theory that the license of training code propagates to such AI models been completely refuted?
Actually, it has not. This issue remains an indeterminate problem where lawsuits are still ongoing and the judgments of major national governments have not been made clear. In this article, I will explain the current situation of this license propagation theory, namely “GPL propagates to AI models trained on GPL code,” and connect it to points of discussion such as the legal positioning of models and the nature of the freedom we pursue in the AI domain.
- The Current Standing in Two Lawsuits
- Arguments Negating the Theory of License Propagation to Models
- The Stance of OSI and FSF
- Summary
- References
Note: This article is an English translation of a post originally written in Japanese. While it assumes a Japanese reader, I believe it may also be useful for an English-speaking audience.
The Current Standing in Two Lawsuits
First, let us organize what the “GPL propagation theory to AI models” entails. This is the idea that when an AI model ingests GPL code as training data, the model itself constitutes a derivative work (derivative) of the GPL code; therefore, when distributing the model, the copyleft conditions of the GPL, such as the obligation to disclose source code, apply. In other words, it is not a question of whether the output of the model is similar to the GPL code, but a theory that “since the model itself is a derivative containing GPL code, the GPL extends to the model.” While there were many voices supporting this theory around 2021, as mentioned earlier, it is no longer the mainstream of the discussion today. However, two major ongoing lawsuits can be cited as grounds that this theory has not been completely denied. These are Doe v. GitHub (the Copilot class action) filed in the United States and GEMA v. OpenAI filed in Germany. I will explain the history and current status of each lawsuit below.
Doe v. GitHub (Copilot Class Action): The Persisting Claim of Open Source License Violation
In the Copilot class action filed at the end of 2022 in relation to GitHub Copilot, anonymous developers became plaintiffs and argued that GitHub, Microsoft, and OpenAI trained their models on source code from public repositories without permission, inviting massive license violations through Copilot. Specifically, they viewed it as problematic that when Copilot reproduces part of the code that served as the training source in its output, it does not perform the author attribution or copyright notice required by licenses such as MIT or Apache-2.0 at all, and furthermore, it indiscriminately trains on and outputs code under licenses that impose copyleft conditions like the GPL, thereby trampling on license clauses. The plaintiffs claimed this was a contractual violation of open source licenses and also sought damages and injunctions, asserting that it constituted a violation of the Digital Millennium Copyright Act (DMCA) under copyright law.
In this case, several decisions have already been handed down by the United States District Court for the Northern District of California, and many of the plaintiffs’ claims have been dismissed. What were dismissed were mainly peripheral claims such as DMCA clause violations, privacy policy violations, unjust enrichment, and torts, but some DMCA violations and the claim of “violation of open source licenses” (breach of contract) are still alive. Regarding the latter specifically, the argument is that despite the plaintiffs’ code being published under licenses like GPL or MIT, the defendants failed to comply with the author attribution or the obligation to publish derivatives under the same license, which constitutes a contractual violation. Although the court did not recognize claims for monetary damages because the plaintiffs could not demonstrate a specific amount of damage, it determined that there were sufficient grounds for the claim for injunctive relief against the license violation itself. As a result, the plaintiffs are permitted to continue the lawsuit seeking an order prohibiting the act of Copilot reproducing others’ code without appropriate license indications.
As is clear from the above history, “violation of open source licenses in training data” is still being contested in court in the Copilot litigation, and this is one of the reasons why the theory of license propagation to models has not been completely denied. The plaintiffs’ claim in this lawsuit does not directly demand the release of the model itself under the GPL, but it legally pursues the point that license conditions were ignored in the process of training and output; consequently, it suggests that “if the handling does not follow the license of the training data, the act of providing the model could be illegal.” Furthermore, the court has not clearly rejected this logic at this stage and has indicated a judgment that the use of open source code is accompanied by license obligations, and providing tools that ignore this could constitute a tort subject to injunction.
However, it is necessary to note that the claims in the Copilot litigation are legally framed as breach of contract (license) or DMCA violation, and are not a direct copyright argument that “the model is a derivative work of GPL code.” No judgment has been shown stepping so far as to mandate the disclosure of the entire model under the GPL license. The actual judgment is conservative, stating “monetary damages have not been shown, but there is room for future injunctive relief,” and does not mention the obligation to disclose the model itself. In other words, at present, there is no judicial precedent directly addressing the “GPL propagation theory to models,” and the situation is one where the issue raised regarding license violation of the source code remains alive in the judicial arena.
GEMA v. OpenAI: The Theory Treating “Memory” in Models as Legal Reproduction
Another important lawsuit is the case where the German music copyright collective GEMA sued OpenAI. This is a copyright lawsuit concerning the unauthorized training and output of lyrics by an AI model, not AI code generation, but it carries significant theoretical implications related to “license propagation to models” even if not directly related to GPL.
In November 2025, the Munich I Regional Court handed down a judgment on this lawsuit, indicating regarding the matter where the ChatGPT model had memorized and reproduced the lyrics of 9 famous German songs, that the act of “memory” inside the model itself falls under the act of reproduction under copyright law. According to the judgment, the lyrics under the plaintiff’s management were “fixed” in the models of ChatGPT’s GPT-4 and 4o, and the situation was such that the lyrics were output almost verbatim just by the user giving a simple prompt. Based on this, the court determined that the model contains “parameters that memorized the work” internally, and if it is possible to reproduce an expression substantially identical to the original work for a human by means of an appropriate prompt, that memory itself falls under “reproduction” in Article 16 of the German Copyright Act. Furthermore, it determined that the act of actually outputting lyrics in response to a prompt is also a separate act of reproduction, and providing lyrics to the user falls under the act of making available to the public (public transmission). Also, it ruled that since all of these are done without the permission of the rights holder, they deviate from the scope justified by the TDM (Text and Data Mining) exception in the EU DSM Copyright Directive.
The important point of this judgment is that it clearly acknowledged that “if a work is recorded inside the model in a reproducible form, that state itself can constitute copyright infringement.” The court cited the text of the EU InfoSoc Directive that “reproduction includes copies in any form or manner, and does not need to be directly perceptible to humans,” and stated that in the spirit of this, even if the lyrics are encoded within the model’s parameters, it amounts to the creation of a reproduction. It went as far as to mention that “encoding in the form of probabilistic weights does not prevent it from being considered a copy,” showing a strong recognition that differences in technical formats cannot avoid the nature of reproduction under copyright law. Also, since the fact that the model could output the lyrics was not coincidental but highly consistent, it was factually found that “the direct incorporation of the essential part of the training data” occurred rather than the result of statistical learning. As a result, the Munich District Court recognized OpenAI’s liability for injunction and damages regarding the output act of the lyrics in question, and further ordered the provision of information regarding training data and output content for the future. However, this judgment is the first instance, and since OpenAI has indicated an intention to appeal, it is expected to be a continuing dispute.
The noteworthy theory shown by this GEMA judgment is the extension of the concept of reproduction under copyright law to the interior of the model. That is, if the work used as training data remains within the model and can be reproduced with a simple operation, it means the model already contains a reproduction of that work. This theory is groundbreaking in that it deems “the model contains the source work,” and indeed, in a commentary by Osborne Clarke, it is evaluated that “in contrast to the judgment of the English High Court in the Getty v. Stability AI case, the Munich District Court explicitly recognized the possibility that the AI model contains copies of the training material.” Standing on this view, the model is not merely a result of analysis, but depending on the case, can be evaluated as an aggregate of the training data itself.
However, it is necessary to keep in mind that this judgment is based on an extreme case where a complete match output was obtained with short text such as lyrics. The court itself stated, “Normally, temporary reproduction for learning remains within the purpose of analysis and does not infringe on the rights holder’s market, but in this case, the model holds the work in a restorable form and exceeds the scope of analysis,” emphasizing that the judgment is limited to “cases where the model performs complete reproduction.” Also, as the UK case shows, judicial decisions vary by country, and a legal consensus on this issue has not yet been formed.
Nevertheless, the judgment this time, which declared that the recording of a work inside a model is a reproduction, can become a major basis supporting the license propagation theory. This is because, while the premise for discussing GPL propagation is “whether the model can be said to be a reproduction or derivative work of the GPL code,” the logic of the Munich District Court legally certified exactly that “a model can be a reproduction of training data”.
Possibilities Derived from the Current Status of the Two Lawsuits
From the two lawsuits above, we can consider the path through which the theory of license propagation to AI models might be recognized in the future.
Let us assume the worst-case scenario from the perspective of AI operators, where these lawsuits are finalized with the plaintiffs winning. In the Copilot litigation, the judgment that “model providers must comply with the license conditions of the training source code” would be established, and in the GEMA litigation, the legal principle that “the model encompasses reproductions of the work” would be established. When these two intersect, the conclusion that “since an AI model containing GPL code is a reproduction or derivative work of the GPL code, the conditions of the GPL directly apply to its provision” is theoretically derived. That is, the possibility emerges that the theory of GPL propagation to models is effectively ratified by the judiciary.
Specifically, if the model memorizes and contains GPL code fragments internally, the act of distributing or providing that model to a third party may be regarded as the distribution of a reproduction of GPL code; in that case, the act of distribution under conditions other than GPL would be evaluated as a GPL license violation. If a GPL violation is established, there would be room to argue for remedies such as injunctions and claims for damages, as well as forced GPL compliance demanding the disclosure of the entire model under the same license, just as in the case of ordinary software. In fact, the remedies GEMA sought from OpenAI included disclosure regarding training data and output content, and although this is in the context of musical works, this can be said to be a type of disclosure request to make transparent “what the model learned and contains.” In the case of a GPL violation as well, the possibility cannot be denied that demands such as “disclosure of the GPL code parts contained inside the model” or “source disclosure in a form that allows reconstruction of the model” would emerge in seeking license compliance.
Even if not reaching such an extreme conclusion, an intermediate scenario could involve imposing certain restrictions on model providers. For example, the Copilot litigation might be settled or judged by taking measures such as “attaching a license and author attribution at the time of output if existing code of a certain length or more is included in the generated code,” or technically mandating the implementation of filters so that GPL code fragments are not extracted or reproduced from the model. In fact, GitHub, the developer of Copilot, has already introduced an optional feature that “excludes from suggestions if the candidate code matches existing code on large-scale repositories,” attempting to reduce litigation risk. Also regarding OpenAI, there are reports that it strengthened filters so that ChatGPT does not output copyrighted lyrics as they are, in response to the GEMA judgment.
While these are not license propagation itself legally, in practice, they indicate that the industry is steering in the direction of “ensuring the model does not potentially infringe license conditions.” In the future, there is a possibility that guidelines for excluding data with specific license terms like GPL at the model training stage, or mechanisms and systems to guarantee that there is no license-infringing output by conducting output inspections after training, will be established.
In any case, until these two lawsuits are completely settled and the subsequent legislative response is determined, the “theory of GPL propagation to models” has not completely disappeared. It is a scenario that could suddenly become realistic depending on future judgments, and even if the plaintiffs lose in the lawsuits, there is a possibility that support for this theory will reignite within the open source community. It is necessary to note that while it is currently an “undetermined theory not shouted as loudly as before,” that does not mean it has been legally completely denied and resolved. As our community, we need to carefully consider countermeasures while observing these trends and taking into account the legal systems of each country and opposing arguments described in the latter half of this article.
Treatment under Japanese Law
Based on the trends of the overseas lawsuits mentioned above, I will also organize the relationship between AI models, copyrighted works, and licenses under Japanese law. In Japan, Article 30-4 of the Copyright Act, introduced by the 2018 amendment, exists as a provision that comprehensively legalizes reproduction acts associated with machine learning. Furthermore, in March 2024, the Copyright Division of the Council for Cultural Affairs of the Agency for Cultural Affairs published a guideline-like document titled “Thought on AI and Copyright” (hereinafter “the Thought”), presenting a legal organization divided into the development/training stage and the generation/utilization stage of generative AI.
According to “the Thought,” reproduction performed basically for the purpose of AI training is legal as long as it satisfies “information analysis not for the purpose of enjoying the thoughts or sentiments expressed in the work” as defined in Article 30-4. Therefore, acts of collecting and reproducing a wide range of data from the internet to create a training dataset for research and development purposes can be done without the permission of the rights holders in principle. However, what is important is whether an “purpose of enjoyment” is mixed into that training act. “The Thought” states that if training is conducted with the purpose of “intentionally reproducing all or part of the creative expression of a specific work in the training data as the output of generative AI,” it is evaluated as having a concurrent purpose of enjoying the work rather than mere information analysis, and thus lacks the application of Article 30-4. As a typical example of this, “overfitting” is cited, and acts such as making a model memorize specific groups of works through additional training to cause it to output something similar to those works are judged to have a purpose of enjoyment.
Furthermore, “the Thought” also mentions the legal treatment of trained models, stating first that “trained models created by AI training cannot be said to be reproductions of the works used for training in many cases.” This is the view that since the model can generate outputs unrelated to the original in response to various inputs in a general-purpose manner, the model itself is not a copy of any specific work.
However, “the Thought” simultaneously acknowledges the possibility that, exceptionally, in cases where “the trained model is in a state of generating products with similarity to the work that was training data with high frequency,” the creative expression of the original work remains in the model, and it may be evaluated as a reproduction. It also points out that in such cases, the model is positioned as a machine for copyright infringement, and a claim for injunction may be recognized. In short, usually the model is merely statistical data and not the work itself, but if it has turned into a device for spewing out specific works almost as they are, it can be treated as an infringing item; this thinking shares parts with the content of the GEMA judgment.
It is necessary to note that the above organization is strictly a discussion of the scope of application of rights limitation provisions (exception provisions) under the Copyright Act, and does not touch upon the validity of contracts or license clauses. The Agency for Cultural Affairs document discusses from the perspective of “whether it is copyright infringement or not,” and does not deny that even if the training act is legal, contractual liability may arise if it violates terms of service or open source licenses separately. Also, no in-depth view has been shown regarding the propagation of copyleft clauses like the GPL. In Japan’s Copyright Act, there is no override provision where rights limitation provisions like Article 30-4 take precedence over contract conditions, and the “Contract Guidelines on Utilization of AI and Data” by the Ministry of Economy, Trade and Industry suggests the possibility that if there is a contract prohibiting data use between parties, that contract takes precedence.
Therefore, if the license is regarded as a valid contract, even if “training is legal” under Article 30-4 of the Copyright Act, the risk remains that it becomes a “violation of license conditions” under contract law, and it can be said that at least there is no official view organizing the theory of GPL propagation to models. In other words, currently, while the legality of model training acts is recognized quite broadly under the Copyright Act, license violation is left to general civil theory, and there is no clear guideline on, for example, “whether the act of publicly distributing a model trained on GPL code constitutes a GPL license violation.” Overall, the legal organization in Japan is in a situation of “safe in principle at the copyright layer, but blank at the contract layer.” Hence, the discussion in Japan regarding the theory of GPL propagation to models relies on future judicial judgments and legislative trends, and at present, there is no choice but to consider operational guidelines carefully following the organization by the Agency for Cultural Affairs.
Arguments Negating the Theory of License Propagation to Models
As seen in the previous sections, the theory of GPL propagation to models is not legally zero. However, many legal experts and engineers point out that this theory has serious detrimental effects. Here, I present representative arguments negating the theory of license propagation to models from the layers of copyright law, GPL text, technology, and practical policy.
Arguments for Negation at the Copyright Law Layer
First, under copyright law, it is unreasonable to regard an AI model as a “derivative work” or “reproduction” of the training source works. In many cases, the expressions of specific works are not stored inside the model in a form recognizable to humans. The model merely holds statistical abstractions where text and code have been converted into weight parameters, and that itself is not a creative expression to humans at all. A “derivative work” under copyright law refers to a creation that incorporates the essential features of the expression of the original work in a form that can be directly perceived, but one cannot directly perceive the creativity of the original code from the model’s weights. In other words, the model does not show the nature of a work directly enough to be evaluated as encompassing the original code. For example, the High Court of Justice in the UK stated in the judgment of the Getty v. Stability AI case that “the Stable Diffusion model itself is not an infringing copy of the training images,” showing a negative view on regarding the model itself as a reproduction of works. Thus, there are many cautious positions internationally regarding regarding the model itself as an accumulation of works or a compilation work.
Also, the output generated by the model involves probabilistic and statistical transformations, and in many cases, things that do not resemble the training source at all are output. Even if a match or similarity occurs by chance, it is difficult to prove whether it is a reproduction relying on the original or an accidental similarity. It is not realistic to conduct the certification of reliance and similarity required to discuss copyright infringement for the entire model. Ultimately, in the framework of copyright law, there is no choice but to judge “whether the model relies on a specific work” on a work-by-work basis, and recognizing uniform copyrightability or infringing nature for the model itself is a large leap. As organized in Japanese law where the model is not considered a reproduction in most cases, the schematic of model equals work is considered unreasonable under copyright law.
Arguments for Negation at the GPL Text Layer
Next, looking at the license text and intent of the GPL itself, doubts are cast on the interpretation that GPL propagates to AI models. For example, in the text of GPLv2, the target of copyleft is limited to “derivative works” of the original code provided under GPL and “works that contain the Program.” Typically, this has been interpreted as software created by modifying or incorporating GPL code, or software combined (linked) with GPL code. In the case of an AI model, it is extremely unclear which part of the original GPL code the model “contains.” Even if the model could memorize fragments of the GPL code used for training, it is a tiny fraction when viewed from the entire model, and most parts are occupied by parameters unrelated to the GPL code. There is no clear assumption shown by the GPL drafters as to whether a statistical model that may partially encapsulate information derived from GPL code can be said to be “a work containing the Program”.
Furthermore, GPLv3 requires the provision of software source code in a “preferred form for modification.” If an AI model is a GPL derivative, the problem arises as to what that preferred form for modification would be. The model weights themselves have low readability and editability for humans, and are hard to call a “preferred form for modification.” If we ask whether the training data is the source code, the original trained GPL code itself cannot be said to be the source of the model, nor is it clear if it refers to the entire vast and heterogeneous training dataset. It is difficult to define what should be disclosed to redistribute the model under GPL compliance, and it could lead to an extreme conclusion that all code and data used for model training must be disclosed. While this is what some freedom believers aim for, it can only be said to be unrealistic in reality, and it deviates from the point of the GPL’s intent to enable users to modify and build from source. Thus, existing GPL provisions are not designed to directly cover products like AI models, and forcing their application causes discrepancies in both text and operation.
In fact, in the “Open Source AI Definition” compiled by the OSI (Open Source Initiative) in 2023, regarding “information necessary for modification” of the model, it stopped at stating that sufficiently detailed information about the training data should be disclosed, and did not require the provision of the training data itself in its entirety. Also, it states that model weights and training code should be published under OSI-approved licenses.
In addition, the FSF (Free Software Foundation) itself does not believe that the current GPL interpretation alone can guarantee freedom in the AI domain, and announced in 2024 that it has started formulating “conditions for machine learning applications to be free.” There, the directionality is shown that “the four freedoms should be guaranteed to users including not only software but also raw training data and model parameters,” but this conversely is a recognition that this is not guaranteed under current licenses. The FSF also points out that “since model parameters cannot be said to be source comprehensible to humans, modification through retraining is more realistic than direct editing,” and can be said to be cautious about treating models on the extension of existing GPL. Overall, claiming GPL propagation univocally to AI models that fall outside the wording and assumptions of GPL provisions is unreasonable from the perspective of interpretation.
Arguments for Negation at the Technical Layer
There are also strong counterarguments from a technical perspective against the theory of GPL propagation to models. AI models, particularly those called large language models, basically hold huge statistical trends internally and do not store the original code or text as they are like a database. Returning a specific output for a specific input is merely generation according to a probability distribution, and it is not guaranteed that the same output as the training data is always obtained. If the model does not perform verbatim reproduction of training data except for a very small number of exceptional cases, evaluating it as “containing GPL code” within the model does not fit the technical reality. In fact, the OpenAI side argued in the GEMA lawsuit that “the model does not memorize individual training data, but merely reflects knowledge learned from the entire dataset in parameters.” This argument was not accepted by the Munich District Court, but that was because there was a clear example of lyric reproduction; conversely, unless there is a clear example of reproduction, the view would be that “the model is a lump of statistical knowledge”.
Furthermore, although it has been confirmed that models can output fragments of training data, that proportion is considered extremely limited when viewed from the whole. Regarding the whole as a reproduction based on the existence of partial memory is like claiming the whole is a reproduction of a photograph just because it contains a tiny mosaic-like fragment in an image, which is an excessive generalization. Technically, it is difficult to quantitatively measure how far specific parameters of the model retain the influence of the original data, and the correspondence between the model and training data remains statistical and difficult to draw a line. Therefore, criteria such as “how similar must it be for GPL to propagate?” cannot be established in the first place. The judgment of infringement or not has to be done on an individual output basis, and this would not be consistent with the idea of applying a single license to the entire model. From the technical aspect, since the model is basically a statistical transformation and the majority is unrelated to GPL code, applying GPL collectively can be said to be irrational.
Practical and Policy Arguments for Negation
Finally, major demerits can be pointed out regarding the theory of license propagation to models from practical and policy perspectives. What would happen if this GPL propagation theory were legally recognized? As an extreme example, if 1 million code repositories were used for training a certain large-scale model, all the various licenses contained in them (GPL, MIT, Apache, proprietary, etc.) would “propagate” to the model, and the model provider would have to distribute the model in a form that complies with all 1 million license clauses. As a practical matter, there would be combinations where conditions contradict, such as GPLv2 and Apache-2.0, and attaching and managing a huge collection of copyright notices for one model is nothing but unrealistic. Applying all licenses to an AI model created from training data with mixed licenses is practically bankrupt, and eventually, the only thing that can be done to avoid it would be to exclude code with copyleft licenses like GPL from the training data from the start.
Is such a situation really desirable for our community? The spirit of the GPL is to promote the free sharing and development of software. However, if asserting excessive propagation to AI models causes companies to avoid using GPL code, and as a result, the value held by GPL software is not utilized in the AI era, it would be putting the cart before the horse. In the field of software development, many companies take a policy of not mixing GPL code into their own products, but similarly, if it becomes “do not include GPL in our AI training data,” GPL projects could lose value as data sources. Furthermore, the current legal battles surrounding AI are leaning more towards monetary compensation and regulatory rule-making, and the reality is that they are proceeding in a different vector from the direction of code sharing idealized by GPL. If only the theory of GPL propagation to models walks alone, in reality, only data exclusion and closing off to avoid litigation risks will progress, and there is a fear that it will not lead to the expansion of free software culture.
Policy-wise as well, governments of each country are carefully considering the use of copyrighted works in AI, but at present, there is no example establishing an explicit rule that “license violation of training data generates legal liability for the model.” Even in the EU AI Act, while there are provisions regarding the quality and transparency of training data, it does not demand compliance with open source licenses. Rather, from the perspective of promoting open science and innovation, the movement to allow text and data mining under rights limitations is strong. In Japan as well, as mentioned earlier, the direction is to broadly recognize information analysis use under Article 30-4, and the policy of forcibly applying licenses to AI models is not mainstream in current international discussions.
Based on the above, the theory of license propagation to models is highly likely to cause disadvantages to open source on both practical and policy fronts, and can be said not to be a realistic solution. What is important is how to realize the “freedom of software,” which is the philosophy of open source, in the AI era; the opinion that this should be attempted through realistic means such as ensuring transparency and promoting open model development rather than extreme legal interpretations is potent, and this is something I have consistently argued as well.
The Stance of OSI and FSF
I will also organize what stance major organizations in the open source (and free software) community are currently taking in relation to the theory of GPL propagation to AI models. Representative organizations are the Open Source Initiative (OSI) and the Free Software Foundation (FSF); while they share the goal of software freedom, they do not necessarily take the same approach regarding AI models and training data.
First, the OSI formulated the “Open Source AI Definition” (OSAID) in 2024, defining the requirements for an AI system to be called open source. This definition states that the four freedoms (use, study, modify, redistribute) similar to software should be guaranteed for AI systems as well, and defines requirements regarding “forms necessary for modification” to realize that, requiring the disclosure of the following three elements.
- Data Information: Provide sufficiently detailed information about the data used for training so that a skilled person can reconstruct an equivalent model.
- This does not make publishing the training data itself in its entirety mandatory, but requires disclosing the origin, scope, nature, and acquisition method if there is data that cannot be published, listing data that can be published, and providing information on data available from third parties.
- Code: Publish the complete set of source code for training and running the model under an OSI-approved license.
- Parameters: Publish the model weights (parameters) under OSI-approved conditions.
It should be noted that while OSI states that information regarding the code used for training and training data is indispensable in addition to model weights to realize “Open Source AI,” it does not require the complete disclosure of the training data itself. This is a flexible stance that, for example, if raw data cannot be published due to privacy or confidentiality, explaining the nature of the data by clarifying that fact can substitute. Also, the legal mechanism to ensure free use of model parameters is an issue to be clarified in the future, and at present, no conclusion has been reached on legal rights control (e.g., presence or absence of copyrightability) over parameters either.
As can be read from these, the OSI promotes opening up AI models at the level of the open source definition in principle, but keeps the handling of training data to requirements at the information disclosure level. Thereby, it can be said that the OSI avoids adopting the theory of license propagation to models to demand training data disclosure, and is exploring a realistic solution that first guarantees transparency and reproducibility. In principle, it could be said that the OSI denied the GPL propagation theory at the time of publishing the OSAID definition. Note that I am probably the one who sealed the mandatory argument for training data in the final stage of this definition’s formulation process, and I believe this was the correct judgment.
On the other hand, the FSF and FSF Europe (FSFE) take a stance more faithful to fundamental principles. FSFE declared as of 2021 that “for an AI application to be free, both its training code and training data must be published under a free software license.” That is, to modify or verify the model, one must be able to obtain it including the training data, and therefore both must be free. Also, the FSF itself stated in a 2024 statement, “Under current understanding, for an ML application to be called free, all training data and the scripts processing it must satisfy the four freedoms,” trying to extend the requirements of freedom to data. Thus, FSF/FSFE stands on the position that a model with undisclosed training data is unfree as a whole even if the software part is free.
However, the FSF simultaneously states to the effect that “whether a non-free machine learning application is ethically unjust depends on the case,” mentioning that there can be “legitimate moral reasons” for not being able to publish training data (personal information) of a medical diagnosis AI, for example. In that case, it implies that although that AI is non-free, its use might be ethically permitted due to social utility. One can see an attitude of seeking a compromise between the FSF’s ideal and reality here, but in any case, there is no mistake that the FSF ultimately aims for freedom including training data.
So, does the FSF support the theory of GPL propagation to AI models? Not necessarily. Their claim is closer to an ethical standard or ideal image rather than legal enforceability, and they are not arguing that it applies to models as an interpretation of the current GPL license. Rather, as mentioned before, they are at the stage of trying to create new standards and agreements. Even in the white paper on the Copilot issue funded by the FSF, while legal points such as copyright and license violation are discussed, substantially it has a strong aspect of being told as a GPL compliance problem for users (downstream developers) concerned that they bear the risk of GPL violation if Copilot’s output contains GPL code fragments. This is a caution to developers using AI coding tools rather than GPL application to the model itself, and is different from an approach forcing GPL compliance directly on model providers.
The Software Freedom Conservancy (SFC) naturally has a strong interest in this issue but is also cautious in some respects. The SFC started the protest campaign “Give Up GitHub” against GitHub in 2022, condemning Copilot’s methods as contrary to the philosophy of open source, and is also involved in the Copilot class action. However, in an SFC blog post, regarding this lawsuit, it showed concern about “the risk of interpretations deviating from the principles of the open source community being brought in,” and called on the plaintiffs’ side to comply with community-led GPL enforcement principles as well. The SFC also states that Copilot’s act is an “unprecedented license violation,” and while not fully denying the GPL propagation theory, it can be interpreted as fearing that a judicial precedent undesirable for the community might be created depending on the result of the legal battle. The SFC might be said to be carefully balancing between the aspect of pursuing GPL propagation and the risk of entrusting it to the judiciary.
Finally, what is concerned as the free software camp is that excessive propagation of licenses might conversely invite results that impair freedom. Both OSI and FSF ultimately want to make AI something open that anyone can utilize, but they are carefully assessing whether increasing the purity of legal theory in demands for full data disclosure really leads to achieving the objective. Considering the demerits such as the avoidance of open data due to excessive propagation interpretation or the atrophy effect due to a flurry of lawsuits, I feel that the major organizations share a commonality in that it is essential not to lose sight of the big picture of spreading freedom. Rather than inciting GPL application to models, the pursuit of realistic solutions such as how to make models and data open and which parts should be relaxed in line with reality will likely continue in the future.
Summary
I have looked at the current state of the theory of GPL propagation to AI models above, and as a conclusion, this theory is in a halfway position where “it is not touted as loudly as before, but it has not completely disappeared.” As a result of points such as license violation of training data and reproduction within the model beginning to be scrutinized in lawsuits like the Copilot class action and GEMA v. OpenAI, it even appears that the hurdle for infringement certification is lowering. In fact, the Munich District Court’s judgment deemed model memory as reproduction, and the claim of open source license violation survives in the Copilot litigation.
However, on the other hand, the hurdle for the propagation of licenses like GPL remains high. There is a large gap between infringement being recognized and the conclusion that the entire model must be disclosed under GPL etc. immediately. What the current lawsuits are seeking is also injunctions and damages, not the forced GPL-ization of the model. There are zero examples where the judiciary supported the theory of GPL propagation to models itself, and it is a legally uncharted territory. Even if that claim were attempted somewhere in the future, it would face the legal, technical, and practical counterarguments mentioned earlier.
However, the situation has fluid parts, and there is a possibility that the line will shift depending on the policies of each country and the trends of the community. For example, if pressure from rights holder groups strengthens in Europe, there is a possibility that guidelines including license compliance will be formulated. Also, if a consensus is formed within the community regarding the state of copyleft in the AI era, a new license might appear. If such changes occur, a phase where the theory of propagation to models is re-evaluated will also arrive.
To offer my personal opinion, what is important at this moment is the perspective of how to balance software freedom and freedom in the AI domain. Instead of blindly trying to apply the philosophy of copyleft to AI, it is necessary to think about what is best to maximize freedom while considering the technical nature and industrial structure peculiar to AI. Fortunately, solutions to practical problems such as the open publication of large-scale AI models, dataset cleaning methods, and automated attachment of license notices are already being explored by the open source community. Promoting such voluntary efforts and supporting them with legal frameworks as necessary will likely be the key to balancing freedom and development.
The theory of GPL propagation to models is a point where judgment is divided on whether it is an ideal to be pursued or a nightmare to be avoided. However, as stated in this article, seeing the situation in the current year of 2025, it is not a situation where it will become reality immediately, and the majority of the community is likely maintaining a cautious stance. Although it is speculated that trial and error will continue in the judicial, legislative, and technical aspects in the future, as our community, we need to continue exploring the point of compatibility between technological innovation and software freedom without jumping to hasty conclusions. That process itself can be said to be a new challenge in the AI era on the extension of the free software spirit.
References
