Some training text comes from Wikipedia and other online writing, but high-quality generative AI requires higher-quality input than is usually found on the internet—that is, it requires the kind found in books. In a lawsuit filed in California last month, the writers Sarah Silverman, Richard Kadrey, and Christopher Golden allege that Meta violated copyright laws by using their books to train LLaMA, a large language model similar to OpenAI’s GPT-4—an algorithm that can generate text by mimicking the word patterns it finds in sample texts. But neither the lawsuit itself nor the commentary surrounding it has offered a look under the hood: We have not previously known for certain whether LLaMA was trained on Silverman’s, Kadrey’s, or Golden’s books, or any others, for that matter.

In fact, it was. I recently obtained and analyzed a dataset used by Meta to train LLaMA. Its contents more than justify a fundamental aspect of the authors’ allegations: Pirated books are being used as inputs for computer programs that are changing how we read, learn, and communicate. The future promised by AI is written with stolen words.

Upwards of 170,000 books, the majority published in the past 20 years, are in LLaMA’s training data. In addition to work by Silverman, Kadrey, and Golden, nonfiction by Michael Pollan, Rebecca Solnit, and Jon Krakauer is being used, as are thrillers by James Patterson and Stephen King and other fiction by George Saunders, Zadie Smith, and Junot Díaz. These books are part of a dataset called “Books3,” and its use has not been limited to LLaMA. Books3 was also used to train Bloomberg’s BloombergGPT, EleutherAI’s GPT-J—a popular open-source model—and likely other generative-AI programs now embedded in websites across the internet. A Meta spokesperson declined to comment on the company’s use of Books3; a spokesperson for Bloomberg confirmed via email that Books3 was used to train the initial model of BloombergGPT and added, “We will not include the Books3 dataset among the data sources used to train future versions of BloombergGPT”; and Stella Biderman, EleutherAI’s executive director, did not dispute that the company used Books3 in GPT-J’s training data.

As a writer and computer programmer, I’ve been curious about what kinds of books are used to train generative-AI systems. Earlier this summer, I began reading online discussions among academic and hobbyist AI developers on sites such as GitHub and Hugging Face. These eventually led me to a direct download of “the Pile,” a massive cache of training text created by EleutherAI that contains the Books3 dataset, plus material from a variety of other sources: YouTube-video subtitles, documents and transcriptions from the European Parliament, English Wikipedia, emails sent and received by Enron Corporation employees before its 2001 collapse, and a lot more.

Other datasets, possibly containing similar texts, are used in secret by companies such as OpenAI. Shawn Presser, the independent developer behind Books3, has said that he created the dataset to give independent developers “OpenAI-grade training data.” Its name is a reference to a paper published by OpenAI in 2020 that mentioned two “internet-based books corpora” called Books1 and Books2. That paper is the only primary source that gives any clues about the contents of GPT-3’s training data, so it’s been carefully scrutinized by the development community.

From information gleaned about the sizes of Books1 and Books2, Books1 is speculated to be the complete output of Project Gutenberg, an online publisher of some 70,000 books with expired copyrights or licenses that allow noncommercial distribution. No one knows what’s inside Books2. Some suspect it comes from collections of pirated books, such as Library Genesis, Z-Library, and Bibliotik, that circulate via the BitTorrent file-sharing network. (Books3, as Presser announced after creating it, is “all of Bibliotik.”)

Presser told me by telephone that he’s sympathetic to authors’ concerns. But the great danger he perceives is a monopoly on generative AI by wealthy corporations, giving them total control of a technology that’s reshaping our culture: He created Books3 in the hope that it would allow any developer to create generative-AI tools. “It would be better if it wasn’t necessary to have something like Books3,” he said. “But the alternative is that, without Books3, only OpenAI can do what they’re doing.”

To create the dataset, Presser downloaded a copy of Bibliotik from The-Eye.eu and updated a program written more than a decade ago by the hacktivist Aaron Swartz to convert the books from ePub format (a standard for ebooks) to plain text—a necessary change for the books to be used as training data. Although some of the titles in Books3 are missing relevant copyright-management information, the deletions were ostensibly a by-product of the file conversion and the structure of the ebooks; Presser told me he did not knowingly edit the files in this way.

Many commentators have argued that training AI with copyrighted material constitutes “fair use,” the legal doctrine that permits the use of copyrighted material under certain circumstances, enabling parody, quotation, and derivative works that enrich the culture. The industry’s fair-use argument rests on two claims: that generative-AI tools do not replicate the books they’ve been trained on but instead produce new works, and that those new works do not hurt the commercial market for the originals. OpenAI made a version of this argument in response to a 2019 query from the United States Patent and Trademark Office. According to Jason Schultz, the director of the Technology Law and Policy Clinic at NYU, this argument is strong.

I asked Schultz whether the fact that books were acquired without permission might damage a claim of fair use. “If the source is unauthorized, that can be a factor,” Schultz said. But the AI companies’ intentions and knowledge matter. “If they had no idea where the books came from, then I think it’s less of a factor.” Rebecca Tushnet, a law professor at Harvard, echoed these ideas, and told me that the law was “unsettled” when it came to fair-use cases involving unauthorized material, with previous cases giving little indication of how a judge might rule in the future.

Meta’s proprietary stance with LLaMA suggests that the company thinks similarly about its own work. After the model leaked earlier this year and became available for download from independent developers who’d acquired it, Meta used a DMCA takedown order against at least one of those developers, claiming that “no one is authorized to exhibit, reproduce, transmit, or otherwise distribute Meta Properties without the express written permission of Meta.” Even after it had “open-sourced” LLaMA, Meta still wanted developers to agree to a license before using it; the same is true of a new version of the model released last month. (Neither the Pile nor Books3 is mentioned in a research paper about that new model.)

Control is more essential than ever, now that intellectual property is digital and flows from person to person as bytes through airwaves. A culture of piracy has existed since the early days of the internet, and in a sense, AI developers are doing something that’s come to seem natural. It is uncomfortably apt that today’s flagship technology is powered by mass theft.

Yet the culture of piracy has, until now, facilitated mostly personal use by individual people. The exploitation of pirated books for profit, with the goal of replacing the writers whose work was taken—this is a different and disturbing trend.