英伟达联系安娜的档案以获取书籍。
Nvidia contacted Anna's Archive to access books

原始链接: https://torrentfreak.com/nvidia-contacted-annas-archive-to-secure-access-to-millions-of-pirated-books/

## NVIDIA 面临版权诉讼,因AI训练数据问题 NVIDIA 是人工智能热潮中的主要参与者,这得益于对其芯片的高需求。该公司目前正卷入一场集体诉讼,原告是多位作者,他们指控 NVIDIA 大规模侵犯版权。诉讼声称 NVIDIA 明知其人工智能模型(包括 NeMo 和 Megatron)是在非法获取的受版权保护的书籍上进行训练的。 最初,诉讼集中于“Books3”盗版数据集,但后来经过修订,增加了证据,表明 NVIDIA 积极寻找并支付费用以获取来自“暗影图书馆”的数百万本盗版书籍,其中最著名的是 Anna’s Archive。内部邮件显示,NVIDIA 人员曾联系 Anna’s Archive,尽管已被警告该图书馆的内容非法,但很快获得了继续数据采集的批准。 原告作者声称 NVIDIA 不仅使用了盗版材料,还分发了使客户能够访问侵权数据集的工具。他们要求赔偿损失,认为 NVIDIA 将竞争优势置于版权法之上。此案首次公开披露了一家大型科技公司与已知盗版图书馆之间的直接沟通,引发了关于人工智能训练实践的重大法律和伦理问题。

## Nvidia 与版权问题:摘要 英伟达公司被指控使用 Anna's Archive 中的数百万本盗版书籍来训练其人工智能模型。一项扩大的集体诉讼称,英伟达直接联系该暗网图书馆,寻求对其数据的快速访问,特别是用于“NextLargeLLM”等项目。 英伟达为自己的行为辩护,称其为“合理使用”,认为书籍对于人工智能来说仅仅是统计相关性,类似于人类的学习和记忆方式。这引发了争论,评论员质疑这种逻辑在版权法下是否成立——一些人讽刺地将其应用于其他场景,例如观看电影。 许多人对获取这些书籍的合法性表示担忧,即使英伟达购买了单个副本并对其进行抓取。更广泛的担忧是,大型人工智能公司正在利用受版权保护的材料,而对作者的重视不足,这得益于围绕人工智能训练数据缺乏详细的法律框架。这场讨论凸显了一个潜在的未来,即人工智能对数据的使用是允许的,而人类的访问却不是,从而偏袒公司而非创作者。
相关文章

原文

nvidia logoChip giant NVIDIA has been one of the main financial beneficiaries in the artificial intelligence boom.

Revenue surged due to high demand for its AI-learning chips and data center services, and the end doesn’t appear to be in sight.

Besides selling the most sought-after hardware, NVIDIA is also developing its own models, including NeMo, Retro-48B, InstructRetro, and Megatron. These are trained using their own hardware and with help from large text libraries, much like other tech giants do.

Authors Sue NVIDIA for Copyright Infringement

Like other tech companies, NVIDIA has also seen significant legal pushback from copyright holders in response to its training methods. This includes authors, who, in various lawsuits, accused tech companies of training their models on pirated books.

In early 2024, for example, several authors sued NVIDIA over alleged copyright infringement.

Through the class action lawsuit, they claimed that the company’s AI models were trained on the Books3 dataset that included copyrighted works taken from the ‘pirate’ site Bibliotik. Since this happened without permission, the authors demanded compensation.

In response, NVIDIA defended its actions as fair use, noting that books are nothing more than statistical correlations to its AI models. However, the allegations didn’t go away. On the contrary, the plaintiffs found more evidence during discovery.

‘NVIDIA Contacted Anna’s Archive’

Last Friday, the authors filed an amended complaint that significantly expands the scope of the lawsuit. In addition to adding more books, authors, and AI models, it also includes broader “shadow library” claims and allegations.

The authors, including Abdi Nazemian, now cite various internal Nvidia emails and documents, suggesting that the company willingly downloaded millions of copyrighted books.

The new complaint alleges that “competitive pressures drove NVIDIA to piracy”, which allegedly included collaborating with the controversial Anna’s Archive library.

Competitive pressures

pressure

According to the amended complaint, a member of Nvidia’s data strategy team reached out to Anna’s Archive to find out what the pirate library could offer the trillion-dollar company

“Desperate for books, NVIDIA contacted Anna’s Archive—the largest and most brazen of the remaining shadow libraries—about acquiring its millions of pirated materials and ‘including Anna’s Archive in pre-training data for our LLMs’,” the complaint notes.

“Because Anna’s Archive charged tens of thousands of dollars for ‘high-speed access’ to its pirated collections […] NVIDIA sought to find out what “high-speed access” to the data would look like.”

what data?

Anna’s Archive Points Out Legal ‘Concern’

According to the complaint, Anna’s Archive then warned Nvidia that its library was illegally acquired and maintained. Because the site previously wasted time on other AI companies, the pirate library asked NVIDIA executives if they had internal permission to move forward.

This permission was allegedly granted within a week, after which Anna’s Archive provided the chip giant with access to its pirated books.

“Within a week of contacting Anna’s Archive, and days after being warned by Anna’s Archive of the illegal nature of their collections, NVIDIA management gave ‘the green light’ to proceed with the piracy. Anna’s Archive offered NVIDIA millions of pirated copyrighted books.”

green light

The complaint states that Anna’s Archive promised to provide NVIDIA with access to roughly 500 terabytes of data. This included millions of books that are usually only accessible through Internet Archive’s digital lending system, which itself has been targeted in court.

The complaint does not explicitly mention whether NVIDIA ended up paying Anna’s Archive for access to the data.

Additionally, it’s worth mentioning that NVIDIA also stands accused of using other pirated sources. In addition to the previously included Books3 database, the new complaint also alleges that the company downloaded books from LibGen, Sci-Hub, and Z-Library.

Direct and Vicarious Copyright Infringement

In addition to downloading and using pirated books for its own AI training, the authors allege NVIDIA distributed scripts and tools that allowed its corporate customers to automatically download “The Pile“, which contains the Books3 pirated dataset.

These allegations lead to new claims of vicarious and contributory infringement, alleging that NVIDIA generated revenue from customers by facilitating access to these pirated datasets.

Based on these and other claims, the authors request to be compensated for the damages they suffered. This applies to the named authors, but also to potentially hundreds of others who may later join the class action lawsuit.

As far as we know, this is the first time that correspondence between a major U.S. tech company and Anna’s Archive was revealed in public. This will only raise the profile of the pirate library, which just lost several domain names, even further.

A copy of the first consolidated and amended complaint, filed at the U.S. District Court for the Northern District of California, is available here (pdf). The named authors include Abdi Nazemian, Brian Keene, Stewart O’Nan, Andre Dubus III, and Susan Orlean.

联系我们 contact @ memedata.com