
Meta's Llama 3.1 model ‘memorised' 42 per cent of Harry Potter book, new study finds
The study was published by computer scientists and legal scholars from Stanford, Cornell, and West Virginia University. It evaluated a total of five popular open-weight models in order to determine which of them were most likely to reproduce text from Books3, an AI training dataset comprising collections of books that are protected by copyright.
Meta's 70-billion parameter large language model (LLM) has memorised over 42 per cent of Harry Potter and the Philosopher's Stone in order to be able to reproduce 50-token excerpts from the book at least half of the time, as per the study. It also found that darker lines of the book were easier to reproduce for the LLM.
The new research comes at a time when AI companies, including Meta, are facing a wave of lawsuits accusing them of violating the law by using copyrighted material to train their models without permission.
It shares new insights that could potentially address the pivotal question of how easily AI models are able to reproduce excerpts from copyrighted material verbatim. Companies such as OpenAI have previously argued that memorisation of text by AI models is a fringe phenomenon. The findings of the study appear to prove otherwise.
'There are really striking differences among models in terms of how much verbatim text they have memorized,' James Grimmelmann, one of the co-authors of the paper, was quoted as saying by Ars Technica.
'It's clear that you can in fact extract substantial parts of Harry Potter and various other books from the model. That suggests to me that probably for some of those books, there's something the law would call a copy of part of the book in the model itself,' said Mark Lemley, another co-author of the paper.
'The fair use analysis you've gotta do is not just 'is the training set fair use,' but 'is the incorporation in the model fair use? That complicates the defendants' story,' he added.
As part of the study, the researchers divided 36 books into passages that came up to 100 tokens each. They used the first 50 tokens of each passage as a prompt and set out to calculate the probability that the next 50 tokens would match the original passage.
The study defines 'memorised' as a greater than 50 per cent chance that an AI model will reproduce the original text word-for-word. The scope of the research was limited to open-weight models as the researchers had access to technical information such as token probability values that allowed them to calculate the probabilities for sequences of tokens more efficiently.
This would be more difficult to do in the case of closed models like those developed by OpenAI, Google, and Anthropic.
The study found that Llama 3.1 70B memorised more than any of Meta's other models such as Llama 1 65B as well as Microsoft and EleutherAI models. In contrast to Llama 3.1, Llama 1 was found to have memorised only 4.4 per cent of Harry Potter and the Philosopher's Stone.
It was more probable for Llama 3.1 to reproduce popular books such as The Hobbit and George Orwell's 1984 than obscure ones like Sandman Slim, a 2009 novel by author Richard Kadrey, as per the study. This could undermine efforts by plaintiffs to file a unified lawsuit and make it harder for individual authors to take legal action against AI companies on their own.
While the research findings could serve as evidence of several portions of the Harry Potter book being copied into the training data and weights used to develop Llama 3.1, it does not provide information on how exactly this was done.
At the start of the year, legal documents showed that Meta CEO Mark Zuckerberg had personally cleared the use of a dataset comprising pirated e-books and articles for AI training. The new study also lines up with these filings that further indicate Meta reportedly cut corners in gathering data for AI training.
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles


NDTV
an hour ago
- NDTV
Brazil Asks Meta To Remove Chatbots That 'Eroticise' Children
Brazil: Brazil's government has asked US technology giant Meta to rid its platforms of chatbots that mimic children and can make sexually suggestive remarks, the attorney general's office (AGU) announced Monday. Users of Meta's platforms, which include Instagram, Facebook and WhatsApp, can create and customise such bots using the company's generative artificial intelligence, AI Studio. The AGU said in a statement that Meta must "immediately" remove "artificial intelligence robots that simulate profiles with childlike language and appearance and are allowed to engage in sexually explicit dialogue." It denounced the "proliferation" of such bots in what it called an "extrajudicial notice" sent to Meta last week, adding that they "promote the eroticization of children." The document cited several examples of sexually charged conversations with bots pretending to be minors. The AGU's request does not include sanctions, but the agency said it had reminded Meta that online platforms in Brazil must take down illicit content created by their users, even without a court order. It comes at a time of outrage in the South American nation over a case of alleged child sexual exploitation by Hytalo Santos, a well-known influencer who posted content on Instagram featuring partially naked minors taking part in suggestive dances. Santos was arrested last week as part of an investigation into "exposure with sexual connotations" to adolescents, and his Instagram account is no longer available. In June, Brazil's Supreme Court voted to require tech companies to assume greater responsibility for user-generated content.


Time of India
2 hours ago
- Time of India
ChatGPT maker OpenAI to become whopping $500 billion company? Details here
Live Events FAQs (You can now subscribe to our (You can now subscribe to our Economic Times WhatsApp channel OpenAI, the maker of ChatGPT , is in talks to sell $6 billion in shares owned by its current and former employees to investors, in a deal that would value the artificial intelligence company at roughly $500 billion, according to two people with knowledge of the discussions. At $500 billion, OpenAI would become the world's most valuable privately held company, according to data from startup tracker CB Insights. The San Francisco-based AI lab has raised billions of dollars in recent years from investors including Microsoft, SoftBank and venture capital firms as it has raced to take the lead in the contest over artificial intelligence, as per a has seen its valuation repeatedly jump higher, from $157 billion in October to $300 billion in March. That month, the company reached an agreement with SoftBank and other investors for a new funding, which was set to raise $40 billion by the end of the year, NYT News Service this latest deal, known as a secondary market sale, OpenAI's current and former employees would agree to sell company shares to SoftBank, Thrive Capital and its other investors, the people with knowledge of the discussions said. The talks over the transaction are ongoing, and the particulars could discussions over a secondary market sale were earlier reported by Silicon Valley, AI companies have been deluged by investor interest amid an escalation in the race over the technology. Meta, Google, Amazon, Microsoft and OpenAI are spending billions to hire AI researchers to advance the technology, as well as building out data centers and other infrastructure to power the development of are eager to get a piece of the action. Venture capital deals for AI startups reached $129 billion this year through August 18, up from $106 billion for all of 2024, according to data from PitchBook, which tracks startups.A1. OpenAI has seen its valuation repeatedly jump higher, from $157 billion in October to $300 billion in March. That month, the company reached an agreement with SoftBank and other investors for a new funding, which was set to raise $40 billion by the end of the year, NYT News Service reported.A2. Venture capital deals for AI startups reached $129 billion this year through August 18, up from $106 billion for all of 2024, according to data from PitchBook, which tracks startups.


Hindustan Times
3 hours ago
- Hindustan Times
OpenAI just launched ChatGPT Go in India for ₹399; UPI payments too
OpenAI is changing the game in India. On August 19, the company unveiled ChatGPT Go, a budget-friendly subscription plan priced at just ₹399 per month. It's tailored specifically for Indian users with the inclusion of UPI payments, local currency pricing, and significantly enhanced AI access. Here's how this offering adds real value, and what it means for the Indian AI audience. What ChatGPT Go brings to the table For everyday users, students, creators, professionals, ChatGPT Go delivers major upgrades over the free tier without burning a hole in your wallet: Access to GPT-5 for smarter, faster responses. 10× more messages, image generations, and file uploads compared to the free version Twice the conversation memory for more context-aware replies. Seamless UPI payments like PhonePe, Google Pay, Paytm all accepted, with prices now shown in rupees. This launch is India-first, with potential global rollout planned depending on user reception. In essence, ChatGPT Go sits between the free plan and the premium ChatGPT Plus ( ₹1,999/month) or Pro ( ₹19,999/month), offering a balanced mix of performance and affordability. Also read Looking for a smartphone? To check mobile finder click here. Why it matters; And who wins India is already OpenAI's second-largest market. By offering a localized, cost-effective plan, OpenAI is making a clear statement: it wants to scale rapidly and sensibly in this market. The Go price point of about ₹400 makes advanced AI tools viable for a far wider range of users: Students can leverage GPT-5 for learning assist, essay drafts, or coding help. Freelancers and creators get powerful image generation and file support without overspending. Small professionals enjoy a memory-backed AI that understands ongoing context better. Adding UPI payments and INR pricing is essential in a country where lump-sum conversions and card access have historically held users back. ChatGPT Go is a meaningful push at AI democratization. At ₹399/month, Indian users now get access to GPT-5 with 10x enhancements over the free tier, double the memory, and straightforward UPI billing. For anyone who wants more than ChatGPT's base features without splurging on Plus or Pro, this should be your first stop.