人工智能并未违反版权法，它只是暴露了版权制度的缺陷。

人工智能并未违反版权法，它只是暴露了版权制度的缺陷。
AI Didn't Break Copyright Law, It Just Exposed How Broken It Was

原始链接: https://www.jasonwillems.com/technology/2026/02/02/AI-Copyright/

## 人工智能、版权与岌岌可危的体系数十年以来，版权法一直基于一些不成文的假设：创作是缓慢的，分发是昂贵的，执法是自由裁量的。例如，粉丝艺术就存在于一个被容忍的灰色地带——个人使用可以，出售则不行。生成式人工智能打破了这种平衡，将模糊性转化为巨大的法律和经济问题。核心问题不是*新的*版权侵权，而是*规模*。试图在“训练”阶段执行版权（阻止人工智能学习受版权保护的材料）是不切实际的——互联网充斥着合法使用的受版权保护的内容，并且将其解开是不可能的。监管“生成”同样存在缺陷，因为意图无法确定，处罚也变得荒谬地不成比例。有效的执法最终取决于*分发*——实际发生损害的地方——这反映了现有的在线版权实践。然而，即使这样也存在过度监管和扼杀创造力的风险。此外，全球人工智能发展意味着美国的法规可能无效，可能会将创新推向其他地区。归根结底，为内容稀缺的世界而建立的现有版权法，难以应对人工智能创造流畅、个性化体验的能力。这场辩论不仅仅是关于修复现有规则，而是认识到内容的本质正在发生变化，使传统的版权概念日益过时。我们试图监管一个正在消失的世界，而未来需要提出新的问题。

## 黑客新闻讨论：人工智能、版权和失效的系统一篇题为“人工智能并未破坏版权法，它只是暴露了版权制度的缺陷”的文章引发了一场黑客新闻的讨论，其中心在于版权执行方面的虚伪性。许多人指出，过去对版权的批评主要来自那些认为其是大型企业扼杀竞争的工具，而当前的愤怒则源于企业*无视*版权，损害艺术家和小型公司。核心论点是，问题并不一定在于版权的*原则*，而在于其执行和实施方式，后者始终偏袒强大的实体。一些评论员提倡进行重大改革，建议缩短版权期限（例如，5-20年，并分层使用权和版税），以更好地支持创作者，同时促进创新。还有人强调了一种历史上的不一致性：过去批评版权的科技行业，现在却在*他们*从侵权中获益时为其辩护。讨论还涉及到一个观点，即法律通常是为人类规模的场景设计的，难以适应人工智能和大规模数据使用的复杂性。最终，许多人认为当前的情况揭示了一个根本性的缺陷：法律体系常常服务于资本利益，而非个人和创作者的利益。

If you paint a picture of Sonic the Hedgehog in your living room, you are technically creating an unauthorized derivative work—but in practice, no one cares. Private, noncommercial creation has always lived in a space where copyright law exists on paper but is rarely enforced.

Gift it to a friend? Still functionally tolerated—a technical act of distribution that copyright law mostly ignores at human scale. Take a photo and post it on Instagram? Now you’ve crossed into public distribution of a derivative work without permission. Under the letter of the law, that’s infringement, although a fair-use defense might apply and Sega almost certainly won’t care. It’s fan engagement, free marketing, and good PR.

Sell that painting, though, and the tolerance disappears. You’re no longer a fan, you’re a competitor.

This gap between the written law and its practical application isn’t an AI problem. It’s something that has been quietly tolerated for decades. The system works not because the rules are clear, but because enforcement has been selectively targeted for impact and the scale has been human. With generative AI, that ambiguity suddenly represents billions of dollars in litigation and potential damages, transforming what was once informal tolerance into an existential legal battle that forces long-implicit assumptions into the open.

Copyright Law Was Built for Human Scale

Copyright law has always relied on a set of quiet assumptions that were never written down but broadly understood: creation and transformation are slow, distribution is costly, and enforcement can be discretionary. Most infringement has been tolerated not because it was legal, but because it was rare, small, and culturally useful.

Fan art lives in this space. So do fan films. Star Wars is a canonical example; not because Lucasfilm formally granted rights, but because informal norms emerged: no monetization, no confusion with official releases, and the understanding that rights holders could assert control at any moment. This worked because these projects were expensive to make, clearly unofficial, and very few in number.

AI didn’t cherry-pick gray areas in copyright law; it removed the human-scale constraints that made those gray areas manageable. When AI can generate feature-length content at ultra-high volume, rights holders will inevitably argue that mere creation—not just distribution—must be prohibited.

But is there an easy solution?

The Training Layer: Why You Can’t Solve It There

The most common proposal is simple: ban training on copyrighted content. Keep AI “clean” from the start.

This sounds reasonable until you examine what “clean” actually means.

Even if you scrupulously avoid pirated material, the internet is saturated with legally posted content about copyrighted characters. Fair-use images of Sonic appear in reviews, commentary, parody, and news articles. There are photographs of licensed Sonic merchandise. There are text descriptions in product listings, forum discussions, and blog posts.

An AI trained only on legally accessible data will still learn what Sonic looks like. It will learn the character design, the color palette, the aesthetic. You cannot train on the open internet without absorbing information about major cultural properties. That knowledge emerges from millions of non-infringing references.

It gets worse. An entire article’s worth of information can be reconstructed from fair-use snippets scattered across the web—a quote here, a summary there, facts recombined from public sources. The model may not copy any single source; it synthesizes patterns from a sea of legitimate data.

Copyright law does recognize intermediate copies as potentially infringing absent fair-use defenses, and courts have historically treated unauthorized reproductions, however temporary, as relevant. But applying that doctrine to modern model training collapses under scale. When training datasets contain billions of documents, establishing precisely what went in, which copies matter, and what causal role any individual work played becomes functionally impossible.

Even outside AI, the boundaries around “transformation” have always been murky. Artisans de Genève, for example, is an independent workshop that was sued by Rolex for modifying its watches. The eventual compromise was not a clean rule about transformation, but a narrow procedural boundary imposed by settlement: Artisans de Genève could only modify watches brought in directly by customers, not purchase them themselves. The law avoided defining where transformation ends and infringement begins; it simply constrained scale and distribution.

Model training removes those constraints entirely.

And even when pirated content is used in training, proving it is extraordinarily difficult—and removing it without retraining is nearly impossible. The New York Times attempted to demonstrate this by prompt-engineering models to reproduce exact articles through next-token prediction (a resource-intensive effort that doesn’t scale, which OpenAI characterizes as a form of adversarial prompting). Declaring an entire model “tainted” and subject to destruction because some fraction of its inputs were improperly sourced pushes copyright into uncharted territory. Unlike trade secret law, copyright traditionally remedies harm through damages, not destruction of downstream systems.

The training layer simply cannot act as a reliable enforcement point.

But rather than disappearing entirely, that pressure shifts downward, toward generation and distribution.

The Generation Layer: Creation Without Friction

At the generation layer, AI systems turn abstract knowledge into concrete outputs. This is where large model providers implement partial copyright enforcement today, implicitly framing LLMs as something more than neutral tools. Adobe Illustrator doesn’t impose IP restrictions on what users can draw; LLMs increasingly do.

Effective enforcement at this layer requires treating intent, experimentation, parody, reference, and coincidence identically. This is not a subtle shift. Copyright law was never designed to evaluate private acts of creation at massive scale.

Two problems immediately emerge:

First, intent becomes unreadable. A model generating something “similar” to Sonic isn’t forming an artistic goal. It’s sampling from a probability space shaped by its training. Inferring copyright-relevant intent here is conceptually incoherent and operationally impossible. Enforcement degrades into a cat-and-mouse game of banned phrases and descriptions that users can always circumvent (purposefully, or otherwise). There is no discrete boundary where output becomes infringing, nor any realistic way for a model to recognize all forms of infringement across global IP.

Second, statutory damages become absurd. Copyright penalties were calibrated for rare, explicit, human-scale violations, often with willfulness in mind. When generation is cheap and ubiquitous, applying the same framework produces penalties so disproportionate that enforcement loses legitimacy. A single automated script generating infringing content at scale could theoretically expose model providers or users to catastrophic liability entirely disconnected from actual harm.

Crucially, generation alone does not imply harm. Most generated content is never shared. Context matters.

Which brings us to the only layer where copyright law has ever really functioned.

The Distribution Layer: Where Harm Actually Manifests

Copyright law polices distribution, not thought. Harm occurs when works are shared, substituted, or monetized.

Uploading an AI-generated Sonic animation to a private folder is meaningfully different from uploading it to YouTube. Flooding a platform with AI-generated Star Wars shorts is meaningfully different from sketching them for yourself. Market substitution, audience diversion, and brand dilution occur only once content is distributed.

This is why enforcement mechanisms like takedowns, DMCA safe harbors, content-ID systems, and platform moderation, however imperfect, operate here. They’re blunt, but they align with where harm occurs.

Trying to push enforcement downward into generation destabilizes the system. You end up policing private creation before any material harm occurs because you cannot control public scale.

The Liability Maze: Everyone Loses

At this point, the question becomes not just where enforcement happens, but who absorbs the liability when it does. Different enforcement points pull different actors into the blast radius, and none of them produce a clean outcome. We’ve already covered why training enforcement is simply unenforceable. But how about generation or distribution?

Enforcement at Generation

If liability attaches at the moment of generation, responsibility has to fall on either model providers or users.

Model Providers Require safeguards, filters, surveillance of API usage, and comprehensive IP databases.

This overwhelmingly favors incumbents. Google, OpenAI, and Meta can afford legal teams, bespoke filtering infrastructure, and licensing deals with major rights holders. A startup can’t. An open-source project certainly can’t. Compliance becomes an existential cost, entrenching monopolies and freezing competition at exactly the wrong moment.

Users Make generating copyrighted content illegal.

This quickly collapses into absurdity. Is typing “Mario” now a crime? What about “Italian plumber in a red hat”? What if a vague prompt unintentionally produces something infringing? Is discovering gaps in the model safeguards akin to hacking or infringement? Enforcing this would require monitoring generation at scale; a surveillance regime that’s both impractical and deeply unsettling to users who share deeply personal information with AI models (a concern already surfacing in NYT v. OpenAI–adjacent arguments that would require access to large volumes of OpenAI customer query data).

Generation-level enforcement either centralizes power or becomes extremely overzealous in its application. Sometimes both.

Enforcement at Distribution

Alternatively, liability can attach when content is shared, published, or monetized.

Platforms and intermediaries. Shift responsibility to hosting and distribution layers.

This mirrors how copyright already functions online. Individual users are technically liable for infringement, but platforms are shielded by safe-harbor regimes so long as they remove content when notified. In practice, this turns legal enforcement into platform policy.

Faced with massive volumes of AI-generated content, platforms would respond defensively. The safest move is over-enforcement: aggressive takedowns, broad bans on derivative works, and stricter moderation of fan art and remix culture. The result is likely to be quiet erosion of some of the internet’s most creative ecosystems.

Users still bear liability in theory, but experience it indirectly through rules that increasingly err on the side of deletion.

No Clean Cut

Every enforcement point shifts responsibility and damage elsewhere:

Generation favors incumbents or mandates surveillance.
Distribution incentivizes over-policing and cultural flattening.
User liability is impractical at scale.

None of these approaches resolve the tension so much as relocate it, trading one set of pathologies for another.

Foreign Models: Why US Regulation Might Not Matter

Even if the U.S. solved every problem above perfectly, it might not matter.

AI is global. If U.S. regulation becomes too restrictive, development elsewhere will outpace it. A foreign company can train models with minimal copyright constraints, host them abroad, and make them accessible to Americans. Using foreign AI services is not inherently illegal today.

This leaves bleak options regarding foreign models:

Restrict access to foreign models: Block access via ISPs, DNS providers, and hosting platforms—an approach that has not meaningfully stopped piracy

Criminalize use: Nearly impossible to enforce without mass internet surveillance.

Restrict commercial outputs: Accept access as inevitable, but prohibit monetization without compliance. This is more enforceable, but requires traceability through watermarking or permitting systems for commercial goods. It is also predicated on the assumption that non-commercial content creation won’t damage the rights holder’s ability to monetize their work.

The likely outcome is already emerging: a two-tier system. “Safe” U.S. AI for commercial use, increasingly tightly integrated with major rights holders, while accepting unavoidable access to foreign or open-source models for everyone else.

Once the ecosystem splits this way, compliance stops being a universal requirement and becomes a choice. Large companies that need legal certainty will opt into the regulated tier. Everyone else will not. And once model weights are public, compliance becomes voluntary in a much more literal sense.

You can sue Meta. You can’t stop someone in another jurisdiction from hosting an unrestricted fork. Strict domestic regulation risks handicapping U.S. companies while failing to protect rights holders. It also places a disproportionate burden on smaller companies that can’t afford the legal and compliance overhead required to navigate these gray areas.

The Pressure Cooker: Irreconcilable Forces

All of this is playing out against a backdrop of enormous, contradictory pressures.

Publishers are desperate. Each technological wave—blogs, social media, ad blockers, paywalls—eroded their leverage. AI feels like another existential blow, spanning text, images, audio, and video.

Governments are trapped. Historically aligned with rightsholders, they now face the political cost of crippling domestic AI while global competitors surge ahead. Meanwhile, AI companies now make the GDP-growth argument once owned by the media industry.

Searching for Alternatives

Given those pressures, it’s tempting to look for alternative enforcement frameworks that can satisfy all parties. The problem is that every proposal fails somewhere fundamental.

Training-based enforcement is untraceable. Generation-based enforcement is incoherent. Distribution-based enforcement is reactive, incomplete, and historically failed to curtail piracy. Jurisdictions diverge, and global access routes around national rules.

It’s not just enforcement breaking down, but attribution itself. Influence can’t be cleanly measured, harm can’t be localized, and responsibility can’t be assigned without collapsing into surveillance or overreach.

This naturally leads to proposals that sidestep enforcement entirely, embedding compensation directly into the AI industry itself.

Government-mediated compensation: Tax AI companies and redistribute funds to rights holders, similar to blank media levies. But how do you determine fair distribution? Should a photographer whose work appeared once in training receive the same as an artist whose style was heavily learned? At internet scale, everyone with an online presence becomes a rights holder whose work was ingested indirectly. The administrative complexity is staggering. The likely outcome is either a blunt, unfair allocation or a system that primarily benefits the largest players who can navigate it best.

Mandatory licensing: Treat AI training like music covers, with compulsory licenses at government-set rates. But where does licensing apply? At training time, when data ingestion is diffuse and provenance unclear? At generation time, forcing cumbersome per-output fees? Or only at commercial use, which just pushes enforcement back onto platforms? Licensing works when the unit being licensed is discrete and identifiable. In AI systems, the unit of influence is neither.

What these approaches share is an assumption that influence can be measured, attributed, and priced with reasonable precision. That assumption no longer holds.

If both traditional enforcement and alternative frameworks collapse under their own complexity, we’re left with an uncomfortable question: should AI companies bear responsibility at all for diffuse, non-attributable learning? And do existing IP norms—built around discrete works and identifiable acts of copying—still make sense in a world where content creation is increasingly statistical, generative, and cheap?

No Good Answers: The Fundamental Incompatibility

The uncomfortable truth is that copyright law works best under conditions of scarcity: creation and transformation are costly, distribution is constrained, enforcement can happen at a limited number of chokepoints, and the volume of disputes remains manageable.

While we argue about how AI interacts with these rules, there is a deeper problem: the nature of content itself is changing. We are moving from static, general-purpose works to dynamic, personalized, on-demand experiences.

Imagine the New York Times not publishing fixed articles, but operating an AI service that generates reporting tailored to your preferred length, tone, and depth. Journalists still source information, investigate facts, and shape narratives, but the “article” is assembled dynamically for each reader, potentially never existing twice in the same form.

This pattern is already familiar. Many people now prefer learning about complex topics through conversation with an LLM rather than reading a static Wikipedia page. Over time, this decoupling of interface from underlying information will extend beyond text. Fixed films may give way to interactive narratives that adapt to viewer choices and generate scenes on the fly. With heavy personalization, companies may not even retain a complete record of what content was generated for whom.

In that world, what does copyright mean? If an article is generated fresh for each reader, synthesized from thousands of sources, where do rights attach? Is the output the copyrighted work? The underlying system? Each individualized experience? If a story adapts in real time, is every interaction a new work?

These aren’t distant hypotheticals (or won’t be for long). We are at the beginning of a shift that existing IP frameworks—designed for physical books, fixed recordings, and discrete artworks—were never meant to handle. The problem isn’t that these frameworks are poorly written; it’s that they assume stable works, identifiable acts of copying, and enforceable boundaries.

That’s why every attempted fix breaks down. We try to regulate training, but learning is diffuse. We try to regulate generation, but creation is private and cheap. We try to regulate distribution, but content is fluid, personalized, and global.

The irony is that by the time we resolve today’s AI copyright debates, if we ever do, the thing we’re fighting over may already be obsolete. We’re litigating static articles while the concept of “an article” dissolves into something conversational and adaptive. We’re arguing about copies while the boundary between creation and consumption blurs into a single interactive process.

The problems we’re grappling with today are real and difficult. But they may also be the wrong problems. The AI copyright debate isn’t just exposing how strained existing law has become; it’s revealing that we’re trying to regulate a world that no longer exists while the actual future races ahead, raising questions we haven’t yet learned how to ask.

The ambiguity was always there. AI just made it impossible to keep pretending we had good answers.