(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=43663941

Hacker News 上的一篇讨论围绕着使用受版权保护的材料训练 AI 模型是否构成侵犯版权展开。原文认为这并不构成侵权,将训练等同于分析,而非复制。然而,评论者强烈反对,指出 AI 可以复制受版权保护的作品或创建释义版本。他们认为这类似于非法复制,即使训练过程本身被认为是合法的。 人们担心缺乏关于 AI 生成内容的法律先例,以及 AI 可能规避版权法的可能性。一些人认为,当前围绕版权的法律框架不足以解决 AI 提出的问题。另一些人将 AI 训练比作受版权保护材料的“有损压缩”,类似于 MP3 的工作方式。另一种观点认为,重点应该放在 AI 模型如何影响艺术家的作品以及社会如何从中受益。有人建议 AI 模型本身应该属于公共领域,防止公司从中获利。

相关文章

原文
Hacker News new | past | comments | ask | show | jobs | submit login
Why training AI can't be IP theft (giovanh.com)
13 points by OuterVale 1 hour ago | hide | past | favorite | 19 comments










I asked AI to complete an AGPL code file I wrote a decade ago. It did a pretty good job. What came out wasn't 100% identical, but clearly a paraphrased copy of my original.

Even if we accept the house-of-cards of shaky arguments this essay is built on, even just for the sake of argument, where Open AI breaks my copyright is by having a computer "memorize" my work. That's a form of copy.

If I've "learned" Harry Potter to the level where I can reproduce it verbatim, the reproduction would be a copyright violation. If I can paraphrase it, ditto. If I encode it in a different format (e.g. bits on magnetic media, or weights in a model), it still includes a duplicate.

On the face of it, OpenAI, Hugging Face, Anthropic, Google, and all other companies are breaking copyright law as written.

Usually, when reality and law diverge, law eventually shifts; not reality. Personally, I'm not a big fan of copyright law as written. We should have a discussion of what it should look like. That's a big discussion. I'll make a few claims:

- We no longer need to encourage technological progress; it's moving fast enough. If anything, slowing it down makes sense.

- "Fair use" is increasingly vague in an era where I can use AI to take your picture, tweak it, and reproduce an altered version in seconds

- Transparency is increasingly important as technology defines the world around us. If the TikTok algorithm controls elections, and Google analyzes my data, it's important I know what those are.

That's the bigger discussion to have.



> If I've "learned" Harry Potter to the level where I can reproduce it verbatim, the reproduction would be a copyright violation. If I can paraphrase it, ditto.

Yeah, that's something that I've not seen a good answer to from the "everything AI does is legal" people. Even if the training is completely legal, how do you verify that the generated output is not illegally similar to a copyrighted work that was ingested? Humans get in legal trouble if they produce a work that's too similar. Does AI not? If AI doesn't, can I just write an AI whose job is to reproduce copyrighted content and now I have a loophole to reproduce copyrighted content?



Cleanroom implementation comes to mind.

If I just remember the source code of a 100 line program and then reproduce it verbatim a week later that doesn’t suddenly make it a new work.



What's stopping me from paraphrasing movies by peppering the least significant color bits? Would that make copying them legal?


> If I've "learned" Harry Potter to the level where I can reproduce it verbatim, the reproduction would be a copyright violation.

By your logic anyone with good enough memory violates copyright law just by act of remembering something.



No, because you don’t actually violate copyright law until you produce and distribute copies.

It’s perfectly legal to memorize a book, type up a copy from memory, and to even print out a copy that you keep for yourself. But as soon as you start trying to sell or distribute copies, even for free, now you’re breaking the law as written.



No - it's the reproduction, not the memorization.


I don't see that point in the original comment. Remembering copyrighted content ≠ reproducing a verbatim of it.


This doesn’t seem true. I mean, it might be true if memory could be seen or manipulated, but what would you bring into a court of law to prove that I remembered something too clearly?


Maybe the infringement occurs when a user uses the model to produce the facsimile output.


Good idea. Let’s make it a minefield of copyright infringement for the user so they never know whether it’s emitting something novel or it’s emitting AGPL code.


Copyright reserves most rights to the author by default. And copyright laws thought about future changes.

Copyright laws (in the US) added fair use, which has four tests. Not all of the tests need to fail for fair use to disappear. Usually two are enough.

The one courts love the most is if the copy is used to create something commercial that competes with the original work.

From near the top of the article:

> I agree that the dynamic of corporations making for-profit tools using previously published material to directly compete with the original authors, especially when that work was published freely, is “bad.”

So essentially, the author admits that AI fails this test.

Thus, if authors can show the AI fails another test (and AI usually fails the substantive difference test), AI is copyright infringement. Period.

The fact that the article gives up that point so early makes me feel I would be wasting time reading more, but I will still do it.



"I think the unambiguous answer to this question is that the act of training is viewing and analysis, not copying. There is no particular copy of the work (or any copyrightable elements) stored in the model. While some models are capable of producing work similar to their inputs, this isn’t their intended function, and that ability is instead an effect of their general utility. Models use input work as the subject of analysis, but they only “keep” the understanding created, not the original work."

The author just seems to have decided the answer and worked backwards. When in reality this is very much a ship of theseus type problem. At what point does a compressed jpeg not become the original image but a transformation? The same thing applies. If i ask a model to recite frankenstein and it largely does, is that not a lossy compression of the original. Would the author argue an mp3 isnt a copy of a song because all the information isnt there?

Calling it "training" instead of compression lets the author play semantic games.



The assumption that human learning and “machine learning” are somehow equivalent (in a physical, ethical, or legal sense—the domain shifts throughout the essay) is not supported with evidence here. They spend a long time describing how machine learning is different from human learning on a computational level, but that doesn’t seem to impact the rest of the argument.

I wish AI proponents would use the plain meaning of words in their persuasive arguments, instead of muddying the waters with anthropomorphic metaphors that smuggle in the conclusion.



That's a lot of words to justify what I presume to be the author's pre-existing viewpoint.

Given that "training" on someone else's IP will lead to a regurgitation of some slight permutation of that IP (e.g., all the Studio Ghibli style AI images), I think the author is pushing shit up hill with the word "can't".



There are a few stages involved in delivering the output of a LLM or text-to-image model:

1. acquire training data

2. train on training data

3. run inference on trained model

4. deliver outputs of inference

One can subdivide the above however one likes.

My understanding is that most lawsuits are targeting 4. deliver outputs of inference.

This is presumably because it has the best chance of resulting in a verdict favorable to the plaintiff.

The issue of whether or not it's legal to train on training data to which one does not hold copyright is probably moot - businesses don't care too much about what you do unless you're making money off it.



“If humans were somehow required to have an explicit license to learn from work, it would be the end of individual creativity as we know it“

What about text books, in order to train on a textbook, I have to pay a licensing fee.



The argument in the article breaks down by taking marketing by definition and try to apply it to a technical argument.

You might as well start by saying that the "cloud" as in some computers really float in the sky. Does AWS rain?

This "AI" or rather program is not "training" or "learning" - at least not the way these laws conceived by humans were anticipated or created for. It doesn't fit the usual dictionary term of training or learning. If it did we'd have real AI, i.e. the current term AGI.



I agree you can't just say it's learning and be done with it, but I think there is a discussion to be had about what training a model is.

When they made the MP3 format, for example, they took a lot of music and used that to create algorithms that are effective for reproducing real-world music using less data. Is that a copyright violation? I think the answer is obviously no, so there is a way to use copyrighted material to produce something new based on it, that isn't reproduction.

The obvious answer is that MP3 doesn't replace the music itself commercially, it doesn't damage the market, while the things produced by an AI model can, but by that logic, is it a copyright violation for an instrument manufacturer to go and use a bunch of music to tailor an instrument to be better, if that instrument could be used to create music that competes with it? Again, no, but clearly there is a difference in how much that instrument would have drawn from the works. AI Models have the potential to spit out very similar works which makes them much more harmful to the original works' value.

I think looking at it through the lens of copyright just isn't useful: it's not exactly the same thing, and the rules around copyright aren't good for managing it. Rather, we should be asking what we want from models and what they provide to society. As I see it, we should be asking how we can address the artists having their work fed into something that may reduce the value of their work, it's clearly a problem, and I don't think pushing the onus onto the person using the model not to create anything that infringes is a strategy that will actually work.

I think a reasonable route is that models shouldn't be copyrightable/patentable themselves, companies should not be allowed to rent-seek on something largely based on other people's work, they should be inherently in the public domain like recipes. Of course, legislating something like that is hard at the best of times, and the current environment is hostile to passing anything, let alone something pro-consumer.







Join us for AI Startup School this June 16-17 in San Francisco!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com