With AI initiatives developing at a rapid pace, copyright holders are on high alert. In addition to legislation, several currently ongoing lawsuits will help to define what's allowed and what isn't. Responding to a lawsuit from several authors, Meta now admits that it used portions of the Books3 dataset to train its Llama models. This dataset includes many pirated books.
In recent months, rightsholders of all ilks have filed lawsuits against companies that develop AI models.
The list includes record labels, individual authors, visual artists, and more recently the New York Times. These rightsholders all object to the presumed use of their work without proper compensation.
Several of the lawsuits filed by book authors include a piracy component as well. The cases allege that tech companies, including Meta and OpenAI, used the controversial Books3 dataset to train their models.
The Books3 dataset has a clear piracy angle. It was created by AI researcher Shawn Presser in 2020, who scraped the library of ‘pirate’ site Bibliotik. This book archive was publicly hosted by digital archiving collective ‘The Eye‘ at the time, alongside various other data sources.
The general vision was that the plaintext collection of more than 195,000 books, which is nearly 37GB in size, could help AI enthusiasts build better models, which would spur innovation.
AI Boom Triggers Copyright Troubles
Presser wasn’t wrong, but the dataset didn’t just help garage AI startups. Several of the world’s largest tech companies discovered it too and used it to improve their own language models.
For years, Books3 continued to be freely and widely available, aiding AI researchers and enthusiasts around the world. However, when the AI boom reached the mainstream last year, book authors and publishers took notice, then took retaliatory action.
For example, Danish anti-piracy group Rights Alliance demanded The Eye to remove their copy of Books3, which it did. The dataset also disappeared from the website of AI company Huggingface, citing reported copyright infringement, while others considered their options.
As previously reported by Wired, Bloomberg informed Rights Alliance that it doesn’t plan to train future versions of its BloombergGPT model using Books3, and other companies likely made similar decisions behind closed doors.
Meta Admits Books3 Use
These are noteworthy developments but not all complaints can be resolved with promises. Several lawsuits against OpenAI and Meta remain ongoing, accusing the companies of using the Books3 dataset to train their models.
While OpenAI and Meta are very cautious about discussing the subject in public, Meta provided more context in a California federal court this week.
Responding to a lawsuit from writer/comedian Sarah Silverman, author Richard Kadrey, and other rights holders, the tech giant admits that “portions of Books3” were used to train the Llama AI model before its public release.
“Meta admits that it used portions of the Books3 dataset, among many other materials, to train Llama 1 and Llama 2,” Meta writes in its answer.
permission.
These legal battles are still in their early stages, but may ultimately find their way to the Supreme Court if needed. AI companies have stressed that progress will be hampered if rules and regulations are too strict.
Earlier this week, OpenAI mentioned that fair use is both necessary and critical to building competitive AI models, noting that news organizations can opt out if they wish. Needless to say, this option didn’t previously exist, certainly not for the Books3 database.
We presume that when Presser created Books3, he never envisioned the dataset to be at the center of landmark lawsuits that could define the future of AI. However, the stakes have changed, and the well-intended ‘archiving’ effort is now part of a major copyright clash.