Why training AI can't be IP theft

blagie · 2025-04-12T13:49:23 1744465763

I asked AI to complete an AGPL code file I wrote a decade ago. It did a pretty good job. What came out wasn't 100% identical, but clearly a paraphrased copy of my original.

Even if we accept the house-of-cards of shaky arguments this essay is built on, even just for the sake of argument, where Open AI breaks my copyright is by having a computer "memorize" my work. That's a form of copy.

If I've "learned" Harry Potter to the level where I can reproduce it verbatim, the reproduction would be a copyright violation. If I can paraphrase it, ditto. If I encode it in a different format (e.g. bits on magnetic media, or weights in a model), it still includes a duplicate.

On the face of it, OpenAI, Hugging Face, Anthropic, Google, and all other companies are breaking copyright law as written.

Usually, when reality and law diverge, law eventually shifts; not reality. Personally, I'm not a big fan of copyright law as written. We should have a discussion of what it should look like. That's a big discussion. I'll make a few claims:

- We no longer need to encourage technological progress; it's moving fast enough. If anything, slowing it down makes sense.

- "Fair use" is increasingly vague in an era where I can use AI to take your picture, tweak it, and reproduce an altered version in seconds

- Transparency is increasingly important as technology defines the world around us. If the TikTok algorithm controls elections, and Google analyzes my data, it's important I know what those are.

That's the bigger discussion to have.

protimewaster · 2025-04-12T13:54:55 1744466095

> If I've "learned" Harry Potter to the level where I can reproduce it verbatim, the reproduction would be a copyright violation. If I can paraphrase it, ditto.

Yeah, that's something that I've not seen a good answer to from the "everything AI does is legal" people. Even if the training is completely legal, how do you verify that the generated output is not illegally similar to a copyrighted work that was ingested? Humans get in legal trouble if they produce a work that's too similar. Does AI not? If AI doesn't, can I just write an AI whose job is to reproduce copyrighted content and now I have a loophole to reproduce copyrighted content?

naming_the_user · 2025-04-12T13:53:53 1744466033

Cleanroom implementation comes to mind.

If I just remember the source code of a 100 line program and then reproduce it verbatim a week later that doesn’t suddenly make it a new work.

dietr1ch · 2025-04-12T13:58:40 1744466320

What's stopping me from paraphrasing movies by peppering the least significant color bits? Would that make copying them legal?

wiseowise · 2025-04-12T13:55:00 1744466100

> If I've "learned" Harry Potter to the level where I can reproduce it verbatim, the reproduction would be a copyright violation.

By your logic anyone with good enough memory violates copyright law just by act of remembering something.

Apreche · 2025-04-12T14:00:07 1744466407

No, because you don’t actually violate copyright law until you produce and distribute copies.

It’s perfectly legal to memorize a book, type up a copy from memory, and to even print out a copy that you keep for yourself. But as soon as you start trying to sell or distribute copies, even for free, now you’re breaking the law as written.

great_wubwub · 2025-04-12T13:57:31 1744466251

No - it's the reproduction, not the memorization.

ArinaS · 2025-04-12T13:58:31 1744466311

I don't see that point in the original comment. Remembering copyrighted content ≠ reproducing a verbatim of it.

techpineapple · 2025-04-12T13:58:18 1744466298

This doesn’t seem true. I mean, it might be true if memory could be seen or manipulated, but what would you bring into a court of law to prove that I remembered something too clearly?

HPsquared · 2025-04-12T13:54:40 1744466080

Maybe the infringement occurs when a user uses the model to produce the facsimile output.

throwaway173738 · 2025-04-12T13:58:14 1744466294

Good idea. Let’s make it a minefield of copyright infringement for the user so they never know whether it’s emitting something novel or it’s emitting AGPL code.

gavinhoward · 2025-04-12T14:00:17 1744466417

Copyright reserves most rights to the author by default. And copyright laws thought about future changes.

Copyright laws (in the US) added fair use, which has four tests. Not all of the tests need to fail for fair use to disappear. Usually two are enough.

The one courts love the most is if the copy is used to create something commercial that competes with the original work.

From near the top of the article:

> I agree that the dynamic of corporations making for-profit tools using previously published material to directly compete with the original authors, especially when that work was published freely, is “bad.”

So essentially, the author admits that AI fails this test.

Thus, if authors can show the AI fails another test (and AI usually fails the substantive difference test), AI is copyright infringement. Period.

The fact that the article gives up that point so early makes me feel I would be wasting time reading more, but I will still do it.

basch · 2025-04-12T13:47:55 1744465675

"I think the unambiguous answer to this question is that the act of training is viewing and analysis, not copying. There is no particular copy of the work (or any copyrightable elements) stored in the model. While some models are capable of producing work similar to their inputs, this isn’t their intended function, and that ability is instead an effect of their general utility. Models use input work as the subject of analysis, but they only “keep” the understanding created, not the original work."

The author just seems to have decided the answer and worked backwards. When in reality this is very much a ship of theseus type problem. At what point does a compressed jpeg not become the original image but a transformation? The same thing applies. If i ask a model to recite frankenstein and it largely does, is that not a lossy compression of the original. Would the author argue an mp3 isnt a copy of a song because all the information isnt there?

Calling it "training" instead of compression lets the author play semantic games.

TimorousBestie · 2025-04-12T13:55:22 1744466122

The assumption that human learning and “machine learning” are somehow equivalent (in a physical, ethical, or legal sense—the domain shifts throughout the essay) is not supported with evidence here. They spend a long time describing how machine learning is different from human learning on a computational level, but that doesn’t seem to impact the rest of the argument.

I wish AI proponents would use the plain meaning of words in their persuasive arguments, instead of muddying the waters with anthropomorphic metaphors that smuggle in the conclusion.

EdwardDiego · 2025-04-12T13:54:12 1744466052

That's a lot of words to justify what I presume to be the author's pre-existing viewpoint.

Given that "training" on someone else's IP will lead to a regurgitation of some slight permutation of that IP (e.g., all the Studio Ghibli style AI images), I think the author is pushing shit up hill with the word "can't".

djoldman · 2025-04-12T13:56:27 1744466187

There are a few stages involved in delivering the output of a LLM or text-to-image model:

1. acquire training data

2. train on training data

3. run inference on trained model

4. deliver outputs of inference

One can subdivide the above however one likes.

My understanding is that most lawsuits are targeting 4. deliver outputs of inference.

This is presumably because it has the best chance of resulting in a verdict favorable to the plaintiff.

The issue of whether or not it's legal to train on training data to which one does not hold copyright is probably moot - businesses don't care too much about what you do unless you're making money off it.

techpineapple · 2025-04-12T13:56:56 1744466216

“If humans were somehow required to have an explicit license to learn from work, it would be the end of individual creativity as we know it“

What about text books, in order to train on a textbook, I have to pay a licensing fee.

re-thc · 2025-04-12T12:59:28 1744462768

The argument in the article breaks down by taking marketing by definition and try to apply it to a technical argument.

You might as well start by saying that the "cloud" as in some computers really float in the sky. Does AWS rain?

This "AI" or rather program is not "training" or "learning" - at least not the way these laws conceived by humans were anticipated or created for. It doesn't fit the usual dictionary term of training or learning. If it did we'd have real AI, i.e. the current term AGI.

Latty · 2025-04-12T13:57:39 1744466259

I agree you can't just say it's learning and be done with it, but I think there is a discussion to be had about what training a model is.

When they made the MP3 format, for example, they took a lot of music and used that to create algorithms that are effective for reproducing real-world music using less data. Is that a copyright violation? I think the answer is obviously no, so there is a way to use copyrighted material to produce something new based on it, that isn't reproduction.

The obvious answer is that MP3 doesn't replace the music itself commercially, it doesn't damage the market, while the things produced by an AI model can, but by that logic, is it a copyright violation for an instrument manufacturer to go and use a bunch of music to tailor an instrument to be better, if that instrument could be used to create music that competes with it? Again, no, but clearly there is a difference in how much that instrument would have drawn from the works. AI Models have the potential to spit out very similar works which makes them much more harmful to the original works' value.

I think looking at it through the lens of copyright just isn't useful: it's not exactly the same thing, and the rules around copyright aren't good for managing it. Rather, we should be asking what we want from models and what they provide to society. As I see it, we should be asking how we can address the artists having their work fed into something that may reduce the value of their work, it's clearly a problem, and I don't think pushing the onus onto the person using the model not to create anything that infringes is a strategy that will actually work.

I think a reasonable route is that models shouldn't be copyrightable/patentable themselves, companies should not be allowed to rent-seek on something largely based on other people's work, they should be inherently in the public domain like recipes. Of course, legislating something like that is hard at the best of times, and the current environment is hostile to passing anything, let alone something pro-consumer.

(评论) (comments)

(评论)
(comments)