
Researchers suggest OpenAI trained AI models on paywalled O'Reilly books
OpenAI has been accused by many parties of training its AI on copyrighted content sans permission. Now a new paper by an AI watchdog organization makes the serious accusation that the company increasingly relied on nonpublic books it didn't license to train more sophisticated AI models.
AI models are essentially complex prediction engines. Trained on a lot of data — books, movies, TV shows, and so on — they learn patterns and novel ways to extrapolate from a simple prompt. When a model "writes" an essay on a Greek tragedy or "draws" Ghibli-style images, it's simply pulling from its vast knowledge to approximate. It isn't arriving at anything new.
While a number of AI labs, including OpenAI, have begun embracing AI-generated data to train AI as they exhaust real-world sources (mainly the public web), few have eschewed real-world data entirely. That's likely because training on purely synthetic data comes with risks, like worsening a model's performance.
The new paper, out of the AI Disclosures Project, a nonprofit co-founded in 2024 by media mogul Tim O'Reilly and economist Ilan Strauss, draws the conclusion that OpenAI likely trained its GPT-4o model on paywalled books from O'Reilly Media. (O'Reilly is the CEO of O'Reilly Media.)
In ChatGPT, GPT-4o is the default model. O'Reilly doesn't have a licensing agreement with OpenAI, the paper says.
"GPT-4o, OpenAI's more recent and capable model, demonstrates strong recognition of paywalled O'Reilly book content … compared to OpenAI's earlier model GPT-3.5 Turbo," wrote the co-authors of the paper. "In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O'Reilly book samples."
The paper used a method called DE-COP, first introduced in an academic paper in 2024, designed to detect copyrighted content in language models' training data. Also known as a "membership inference attack," the method tests whether a model can reliably distinguish human-authored texts from paraphrased, AI-generated versions of the same text. If it can, it suggests that the model might have prior knowledge of the text from its training data.
The co-authors of the paper — O'Reilly, Strauss, and AI researcher Sruly Rosenblat — say that they probed GPT-4o, GPT-3.5 Turbo, and other OpenAI models' knowledge of O'Reilly Media books published before and after their training cutoff dates. They used 13,962 paragraph excerpts from 34 O'Reilly books to estimate the probability that a particular excerpt had been included in a model's training dataset.
According to the results of the paper, GPT-4o "recognized" far more paywalled O'Reilly book content than OpenAI's older models, including GPT-3.5 Turbo. That's even after accounting for potential confounding factors, the authors said, like improvements in newer models' ability to figure out whether text was human-authored.
"GPT-4o [likely] recognizes, and so has prior knowledge of, many non-public O'Reilly books published prior to its training cutoff date," wrote the co-authors.
It isn't a smoking gun, the co-authors are careful to note. They acknowledge that their experimental method isn't foolproof and that OpenAI might've collected the paywalled book excerpts from users copying and pasting it into ChatGPT.
Muddying the waters further, the co-authors didn't evaluate OpenAI's most recent collection of models, which includes GPT-4.5 and "reasoning" models such as o3-mini and o1. It's possible that these models weren't trained on paywalled O'Reilly book data or were trained on a lesser amount than GPT-4o.
That being said, it's no secret that OpenAI, which has advocated for looser restrictions around developing models using copyrighted data, has been seeking higher-quality training data for some time. The company has gone so far as to hire journalists to help fine-tune its models' outputs. That's a trend across the broader industry: AI companies recruiting experts in domains like science and physics to effectively have these experts feed their knowledge into AI systems.
It should be noted that OpenAI pays for at least some of its training data. The company has licensing deals in place with news publishers, social networks, stock media libraries, and others. OpenAI also offers opt-out mechanisms — albeit imperfect ones — that allow copyright owners to flag content they'd prefer the company not use for training purposes.
Still, as OpenAI battles several suits over its training data practices and treatment of copyright law in U.S. courts, the O'Reilly paper isn't the most flattering look.
OpenAI didn't respond to a request for comment.
This article originally appeared on TechCrunch at https://techcrunch.com/2025/04/01/researchers-suggest-openai-trained-ai-models-on-paywalled-oreilly-books/
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles
Yahoo
2 hours ago
- Yahoo
Klarna CEO warns AI may cause a recession as the technology comes for white-collar jobs
The CEO of payments company Klarna has warned that AI could lead to job cuts and a recession. Sebastian Siemiatkowski said he believed AI would increasingly replace white-collar jobs. Klarna previously said its AI assistant was doing the work of 700 full-time customer service agents. The CEO of the Swedish payments company Klarna says that the rise of artificial intelligence could lead to a recession as the technology replaces white-collar jobs. Speaking on The Times Tech podcast, Sebastian Siemiatkowski said there would be "an implication for white-collar jobs," which he said "usually leads to at least a recession in the short term." "Unfortunately, I don't see how we could avoid that, with what's happening from a technology perspective," he continued. Siemiatkowski, who has long been candid about his belief that AI will come for human jobs, added that AI had played a key role in "efficiency gains" at Klarna and that the firm's workforce had shrunk from about 5,500 to 3,000 people in the last two years as a result. It's not the first time the exec and Klarna have made headlines along these lines. In February 2024, Klarna boasted that its OpenAI-powered AI assistant was doing the work of 700 full-time customer service agents. The company, most famous for its "buy now, pay later" service, was one of the first firms to partner with Sam Altman's company. Later that year, Siemiatkowski told Bloomberg TV that he believed AI was already capable of doing "all of the jobs" that humans do and that Klarna had enacted a hiring freeze since 2023 as it looked to slim down and focus on adopting the technology. However, Siemiatkowski has since dialed back his all-in stance on AI, telling an audience at the firm's Stockholm headquarters in May that his AI-driven customer service cost-cutting efforts had gone too far and that Klarna was planning to now recruit, according to Bloomberg. "From a brand perspective, a company perspective, I just think it's so critical that you are clear to your customer that there will be always a human if you want," he said. In the interview with The Times, Siemiatkowski said he felt that many people in the tech industry, particularly CEOs, tended to "downplay the consequences of AI on jobs, white-collar jobs in particular." "I don't want to be one of them," he said. "I want to be honest, I want to be fair, and I want to tell what I see so that society can start taking preparations." Some of the top leaders in AI, however, have been ringing the alarm lately, too. Anthropic's leadership has been particularly outspoken about the threat AI poses to the human labor market. The company's CEO, Dario Amodei, recently said that AI may eliminate 50% of entry-level white-collar jobs within the next five years. "We, as the producers of this technology, have a duty and an obligation to be honest about what is coming," Amodei said. "I don't think this is on people's radar." Similarly, his colleague, Mike Krieger, Anthropic's chief product officer, said he is hesitant to hire entry-level software engineers over more experienced ones who can also leverage AI tools. The silver lining is that AI also brings the promise of better and more fulfilling work, Krieger said. Humans, he said, should focus on "coming up with the right ideas, doing the right user interaction design, figuring out how to delegate work correctly, and then figuring out how to review things at scale — and that's probably some combination of maybe a comeback of some static analysis or maybe AI-driven analysis tools of what was actually produced." Read the original article on Business Insider Sign in to access your portfolio
Yahoo
2 hours ago
- Yahoo
Bread & Butter Gourmet Deli in Tarpon Springs closing after 30 years
The Brief The Bread & Butter Gourmet Deli in Tarpon Springs is planning to close its doors after 30 years. It is known for its turkey, falafel and array of soups, salads, and pastries. The owners plan to sell the building to a seafood restaurateur. TARPON SPRINGS, Fla. - A beloved Tarpon Springs deli is closing its doors after more than 30 years. The Bread & Butter Gourmet Deli on Pinellas Ave is known for its turkey, falafel and array of soups, salads, and pastries. It has created a unique fusion of Mediterranean and Middle Eastern cuisine. The backstory The story of the deli started in Youngstown, Ohio, where owners Theo and Nellie Abbas met. Nellie said, "His best friend happened to be my brother's future brother-in-law. He brought me to a dance, a Greek dance. He met me there and that was it." From there, the couple moved to the Big Apple to learn the ropes of cuisine. "We had a deli in New York in Lincoln Center," Theo added, "One day I got up in the morning to go to work and it snowed. I had to clean four cars before mine. I said, 'forget about it. I'm leaving." Fast-forward to the summer of 1994, the couple purchased an old bank in Tarpon Springs, which would become Bread & Butter. "'94 I opened it. And the night before I opened it, Governor Lawton Charles came in here with 200 people. The next day, my line went all the way outside," he added, "I didn't think it was going to have an impact like that, so we didn't have enough food. We ran out of food." These days, there's a similar turnout after the couple announced they are closing the deli by the end of June. Nellie said, "We've gotten flowers. We've gotten cards and last week we got bombarded, we've had so many customers." READ: Odyssey Cruises in Tarpon Springs offers family friendly educational experiences The couple said they came to a realization. She said, "I don't want to start crying, but I got to go. We got to go." With Theo now disabled, they said it's time to slow down. He said it stems from an accident more than 20 years ago, "I fell off the roof and I struck my head, and I had a brain stem injury. I was in a coma at Bayfront hospital for 13 days. They gave me a 1-percent chance to live." Nellie added, "We have four great-grandchildren now. 7 grandchildren, so it's time to relax." What's next The couple plans to sell the building to a seafood restaurateur. Theo said, "It breaks my heart that I have to leave. But all good things must come to an end." Bread & Butter Gourmet Deli is located at 1880 Pinellas Ave, Tarpon Springs, FL 34689. CLICK HERE:>>>Follow FOX 13 on YouTube The Source Information for this story was gathered by FOX 13's Jennifer Kveglis. STAY CONNECTED WITH FOX 13 TAMPA: Download the FOX Local app for your smart TV Download FOX Local mobile app: Apple | Android Download the FOX 13 News app for breaking news alerts, latest headlines Download the SkyTower Radar app Sign up for FOX 13's daily newsletter

CNBC
2 hours ago
- CNBC
Sam Altman brings his eye-scanning identity verification startup to the UK
LONDON — World, the biometric identity verification project co-founded by OpenAI CEO Sam Altman, is set to launch in the U.K. this week. The venture, which uses a spherical eye-scanning device called the Orb to scan people's eyes, will become available in London from Thursday and is planning to roll out to several other major U.K. cities — including Manchester, Birmingham, Cardiff, Belfast, and Glasgow — in the coming months. The project aims to authenticate the identity of humans with its Orb device and prevent the fraudulent abuse of artificial intelligence systems like deep fakes. It works by scanning a person's face and iris and then creating a unique code to verify that the individual is a human and not an AI. Once someone has created their iris code, they are then gifted some of World's WLD cryptocurrency and can use an anonymous identifier called World ID to sign into various applications. It currently works with the likes of Minecraft, Reddit and Discord. Adrian Ludwig, chief architect of Tools for Humanity, which is a core contributor to World, told CNBC on a call that the project is seeing significant demand from both enterprise users and governments as the threat of AI to defraud various services — from banking to online gaming — grows. "The idea is no longer just something that's theoretical. It's something that's real and affecting them every single day," he said, adding that World is now transitioning "from science project to a real network." The venture recently opened up shop in the U.S. with six flagship retail locations including Austin, Atlanta, Los Angeles, Nashville, Miami and San Francisco. Ludwig said that looking ahead, the plan is to "increase the number of people who can be verified by an order of magnitude over the next few months." Ever since its initial launch as "Worldcoin" in 2021, Altman's World has been plagued by concerns over how it could affect users' privacy. The startup says it addresses these concerns by encrypting the biometric data collected and ensuring the original data is deleted. On top of that, World's verification system also depends on a decentralized network of users' smartphones rather than the cloud to carry out individual identity checks. Still, this becomes harder to do in a network with billions of users like Facebook or TikTok, for example. For now, World has 13 million verified users and is planning to scale that up. Ludwig argues World is a scalable network as all of the computation and storage is processed locally on a user's device — it's only the infrastructure for confirming someone's uniqueness that is handled by third-party providers. Ludwig says the way technology is evolving means it's getting much easier for new AI systems to bypass currently available authentication methods such as facial recognition and CAPTCHA bot prevention measures. He sees World serving a pertinent need in the transition from physical to digital identity systems. Governments are exploring digital ID schemes to move away from physical cards. However, so far, these attempts have been far from perfect. One example of a major digital identity system is India's Aadhaar. Although the initiative has seen widespread adoption, it has also been the target of criticisms for lax security and allegedly worsening social inequality for Indians. "We're beginning to see governments now more interested in how can we use this as a mechanism to improve our identity infrastructure," Ludwig told CNBC. "Mechanisms to identify and reduce fraud is of interest to governments." The technologist added that World has been talking to various regulators about its identity verification solution — including the Information Commissioner's Office, which oversees data protection in the U.K. "We've been having lots of conversations with regulators," Ludwig told CNBC. "In general, there's been lots of questions: how do we make sure this works? How do we protect privacy? If we engage with this, does it expose us to risks?" "All of those questions we've been able to answer," he added. "It's been a while since we've had a question asked we didn't have an answer to."