Chip giant NVIDIA has been one of the main financial beneficiaries in the artificial intelligence boom.
Revenue surged due to high demand for its AI-learning chips and data center services, and the end doesn’t appear to be in sight.
Besides selling the most sought-after hardware, NVIDIA is also developing its own models, including NeMo, Retro-48B, InstructRetro, and Megatron. These are trained using their own hardware and with help from large text libraries, much like other tech giants do.
Authors Sue NVIDIA for Copyright Infringement
Like other tech companies, NVIDIA has also seen significant legal pushback from copyright holders in response to its training methods. This includes authors, who, in various lawsuits, accused tech companies of training their models on pirated books.
In early 2024, for example, several authors sued NVIDIA over alleged copyright infringement.
Through the class action lawsuit, they claimed that the company’s AI models were trained on the Books3 dataset that included copyrighted works taken from the ‘pirate’ site Bibliotik. Since this happened without permission, the authors demanded compensation.
In response, NVIDIA defended its actions as fair use, noting that books are nothing more than statistical correlations to its AI models. However, the allegations didn’t go away. On the contrary, the plaintiffs found more evidence during discovery.
‘NVIDIA Contacted Anna’s Archive’
Last Friday, the authors filed an amended complaint that significantly expands the scope of the lawsuit. In addition to adding more books, authors, and AI models, it also includes broader “shadow library” claims and allegations.
The authors, including Abdi Nazemian, now cite various internal Nvidia emails and documents, suggesting that the company willingly downloaded millions of copyrighted books.
The new complaint alleges that “competitive pressures drove NVIDIA to piracy”, which allegedly included collaborating with the controversial Anna’s Archive library.
According to the amended complaint, a member of Nvidia’s data strategy team reached out to Anna’s Archive to find out what the pirate library could offer the trillion-dollar company
“Desperate for books, NVIDIA contacted Anna’s Archive—the largest and most brazen of the remaining shadow libraries—about acquiring its millions of pirated materials and ‘including Anna’s Archive in pre-training data for our LLMs’,” the complaint notes.
“Because Anna’s Archive charged tens of thousands of dollars for ‘high-speed access’ to its pirated collections […] NVIDIA sought to find out what “high-speed access” to the data would look like.”
Anna’s Archive Points Out Legal ‘Concern’
According to the complaint, Anna’s Archive then warned Nvidia that its library was illegally acquired and maintained. Because the site previously wasted time on other AI companies, the pirate library asked NVIDIA executives if they had internal permission to move forward.
This permission was allegedly granted within a week, after which Anna’s Archive provided the chip giant with access to its pirated books.
“Within a week of contacting Anna’s Archive, and days after being warned by Anna’s Archive of the illegal nature of their collections, NVIDIA management gave ‘the green light’ to proceed with the piracy. Anna’s Archive offered NVIDIA millions of pirated copyrighted books.”
The complaint states that Anna’s Archive promised to provide NVIDIA with access to roughly 500 terabytes of data. This included millions of books that are usually only accessible through Internet Archive’s digital lending system, which itself has been targeted in court.
The complaint does not explicitly mention whether NVIDIA ended up paying Anna’s Archive for access to the data.
Additionally, it’s worth mentioning that NVIDIA also stands accused of using other pirated sources. In addition to the previously included Books3 database, the new complaint also alleges that the company downloaded books from LibGen, Sci-Hub, and Z-Library.
Direct and Vicarious Copyright Infringement
In addition to downloading and using pirated books for its own AI training, the authors allege NVIDIA distributed scripts and tools that allowed its corporate customers to automatically download “The Pile“, which contains the Books3 pirated dataset.
These allegations lead to new claims of vicarious and contributory infringement, alleging that NVIDIA generated revenue from customers by facilitating access to these pirated datasets.
Based on these and other claims, the authors request to be compensated for the damages they suffered. This applies to the named authors, but also to potentially hundreds of others who may later join the class action lawsuit.
As far as we know, this is the first time that correspondence between a major U.S. tech company and Anna’s Archive was revealed in public. This will only raise the profile of the pirate library, which just lost several domain names, even further.
—
A copy of the first consolidated and amended complaint, filed at the U.S. District Court for the Northern District of California, is available here (pdf). The named authors include Abdi Nazemian, Brian Keene, Stewart O’Nan, Andre Dubus III, and Susan Orlean.