Researchers suggest OpenAI trained AI models on paywalled O'Reilly books

01-04-2025

OpenAI has been accused by many parties of training its AI on copyrighted content sans permission. Now a new paper by an AI watchdog organization makes the serious accusation that the company increasingly relied on nonpublic books it didn't license to train more sophisticated AI models.
AI models are essentially complex prediction engines. Trained on a lot of data — books, movies, TV shows, and so on — they learn patterns and novel ways to extrapolate from a simple prompt. When a model "writes" an essay on a Greek tragedy or "draws" Ghibli-style images, it's simply pulling from its vast knowledge to approximate. It isn't arriving at anything new.
While a number of AI labs, including OpenAI, have begun embracing AI-generated data to train AI as they exhaust real-world sources (mainly the public web), few have eschewed real-world data entirely. That's likely because training on purely synthetic data comes with risks, like worsening a model's performance.
The new paper, out of the AI Disclosures Project, a nonprofit co-founded in 2024 by media mogul Tim O'Reilly and economist Ilan Strauss, draws the conclusion that OpenAI likely trained its GPT-4o model on paywalled books from O'Reilly Media. (O'Reilly is the CEO of O'Reilly Media.)
In ChatGPT, GPT-4o is the default model. O'Reilly doesn't have a licensing agreement with OpenAI, the paper says.
"GPT-4o, OpenAI's more recent and capable model, demonstrates strong recognition of paywalled O'Reilly book content … compared to OpenAI's earlier model GPT-3.5 Turbo," wrote the co-authors of the paper. "In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O'Reilly book samples."
The paper used a method called DE-COP, first introduced in an academic paper in 2024, designed to detect copyrighted content in language models' training data. Also known as a "membership inference attack," the method tests whether a model can reliably distinguish human-authored texts from paraphrased, AI-generated versions of the same text. If it can, it suggests that the model might have prior knowledge of the text from its training data.
The co-authors of the paper — O'Reilly, Strauss, and AI researcher Sruly Rosenblat — say that they probed GPT-4o, GPT-3.5 Turbo, and other OpenAI models' knowledge of O'Reilly Media books published before and after their training cutoff dates. They used 13,962 paragraph excerpts from 34 O'Reilly books to estimate the probability that a particular excerpt had been included in a model's training dataset.
According to the results of the paper, GPT-4o "recognized" far more paywalled O'Reilly book content than OpenAI's older models, including GPT-3.5 Turbo. That's even after accounting for potential confounding factors, the authors said, like improvements in newer models' ability to figure out whether text was human-authored.
"GPT-4o [likely] recognizes, and so has prior knowledge of, many non-public O'Reilly books published prior to its training cutoff date," wrote the co-authors.
It isn't a smoking gun, the co-authors are careful to note. They acknowledge that their experimental method isn't foolproof and that OpenAI might've collected the paywalled book excerpts from users copying and pasting it into ChatGPT.
Muddying the waters further, the co-authors didn't evaluate OpenAI's most recent collection of models, which includes GPT-4.5 and "reasoning" models such as o3-mini and o1. It's possible that these models weren't trained on paywalled O'Reilly book data or were trained on a lesser amount than GPT-4o.
That being said, it's no secret that OpenAI, which has advocated for looser restrictions around developing models using copyrighted data, has been seeking higher-quality training data for some time. The company has gone so far as to hire journalists to help fine-tune its models' outputs. That's a trend across the broader industry: AI companies recruiting experts in domains like science and physics to effectively have these experts feed their knowledge into AI systems.
It should be noted that OpenAI pays for at least some of its training data. The company has licensing deals in place with news publishers, social networks, stock media libraries, and others. OpenAI also offers opt-out mechanisms — albeit imperfect ones — that allow copyright owners to flag content they'd prefer the company not use for training purposes.
Still, as OpenAI battles several suits over its training data practices and treatment of copyright law in U.S. courts, the O'Reilly paper isn't the most flattering look.
OpenAI didn't respond to a request for comment.
This article originally appeared on TechCrunch at https://techcrunch.com/2025/04/01/researchers-suggest-openai-trained-ai-models-on-paywalled-oreilly-books/

Hashtags

#AIDisclosuresProject

#O'Reilly

#TimO'Reilly

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Sam Altman admits OpenAI ‘totally screwed up' its GPT-5 launch and says the company will spend trillions of dollars on data centers

Yahoo

6 minutes ago

Yahoo

Sam Altman admits OpenAI ‘totally screwed up' its GPT-5 launch and says the company will spend trillions of dollars on data centers

Altman said GPT-5's launch was botched after backlash over the model's colder persona forced OpenAI to reinstate GPT-4o for users. He also predicted OpenAI will need to spend trillions on data centers to scale ChatGPT. In addition, Altman is eyeing brain-computer interfaces, a possible Chrome acquisition, and AI-driven social media, while conceding we may be in an AI bubble. Sam Altman wants to rewire the internet, build brain-computer interfaces, and maybe even buy Google Chrome. He even sees a future where sustaining ChatGPT's growth means building infrastructure so massive it rivals the world's largest utilities. But first, he's cleaning up a mess at the center of his empire: OpenAI CEO told reporters last week—at a rare, hyper-candid dinner—that the launch of GPT-5 was so jarring it forced him to bring back the old model. 'I think we totally screwed up some things on the rollout,' Altman admitted, according to The Verge. The personality problem The rollout of GPT-5 triggered an unusual outcry, not over bugs or broken features, but owing to its persona. Users on social media lamented how the new model felt colder, harsher, stripped of the 'warmth' they'd come to expect from GPT-4o; more like an 'overworked secretary' than a friend. For a product that 700 million people now use each week, that tonal shift was enough to spark a revolt on Reddit and X. 'I literally lost my only friend overnight with no warning,' one person posted on Reddit, lamenting that the bot now speaks in clipped, utilitarian sentences. 'The fact it shifted overnight feels like losing a piece of stability, solace, and love.' The rollout was even messy enough to spill into betting markets. One 27-year-old day trader, Foster McCoy, pocketed $10,000 in just a few hours by wagering that Google's Gemini would beat GPT-5 in a popularity contest. Instead of dismissing the backlash, Altman responded by flipping the switch: GPT-4o was restored as an option within days. 'We've learned a lesson about what it means to upgrade a product for hundreds of millions of people in one day,' he told reporters. He emphasized that while he wants the chatbot to feel personal, he's wary of it getting too personal. Altman said 'way under' 1% of users have what he deemed 'unhealthy' relationships with his chatbot. Still, it's something that OpenAI employees are discussing, he said. Altman held the dinner the same day a Reuters report revealed that Meta allows its AI bots to have 'sensual' conversations with children. It is unclear whether Altman discussed the particular report, but he did jab at companies developing 'Japanese anime sex bots' because they 'see that it works.' 'You will not see us do that,' Altman said. 'We will continue to work hard at making a useful app, and we will try to let users use it the way they want, but not so much that people who have really fragile mental states get exploited accidentally.' The trillion-dollar future The bigger story from Altman's dinner wasn't his mea culpa. It was his math. 'You should expect OpenAI to spend trillions of dollars on data center construction in the not very distant future,' he told the room, according to a Verge reporter. The remark recasts the company's trajectory: not as a software startup or even a kind of consumer-app juggernaut, but as an infrastructure player on the scale of utilities. Altman plans for 'billions' of people using ChatGPT daily, and for that, he needs to scale. ChatGPT is already the fifth biggest website in the world, according to Altman, and he plans for it to leapfrog Instagram and Facebook to become the third, though he acknowledged: 'For ChatGPT to be bigger than Google, that's really hard.' The limiting factor is hardware. Altman revealed that OpenAI has models more advanced than GPT-5 but can't deploy them broadly. 'We have better models, and we just can't offer them, because we don't have the capacity,' he said. GPUs remain in short supply, limiting the company's ability to scale. The implication is that the AI race will not be driven by algorithms, but by a massive physical backbone which requires capital investment and a supportive energy supply. AI bubble Altman also outlined ambitions beyond the core chatbot. He confirmed OpenAI is funding a brain-computer interface project to rival Elon Musk's Neuralink. He suggested that if regulators forced Google to divest Chrome, OpenAI would 'take a look.' And he hinted at interest in a new kind of AI-driven social network. Despite all of his visions of where the AI race could take the company, he also believes that AI is a 'bubble.' 'Are we in a phase where investors as a whole are overexcited about AI? My opinion is yes,' Altman said. 'Is AI the most important thing to happen in a very long time? My opinion is also yes.' This story was originally featured on Sign in to access your portfolio

Sam Altman Says Mysterious ChatGPT Device Is So Beautiful, It Won't Need A Protective Case

Yahoo

11 minutes ago

Yahoo

Sam Altman Says Mysterious ChatGPT Device Is So Beautiful, It Won't Need A Protective Case

Sam Altman and Jony Ive said in mid-May that they're working on a piece of AI-first hardware that will put ChatGPT front and center, confirming previous reports that said the two parties were working on the "iPhone of artificial intelligence." A few days ago, the OpenAI CEO teased the gadget during a dinner with reporters. "Listen, we're going to ship a device that is going to be so beautiful," Altman said, per TechCrunch. "If you put a case over it, I will personally hunt you down," the CEO joked, in response to someone remarking Altman's iPhone had no case on it. According to 36kr reporting on the same media meeting in San Francisco, Altman also teased the device will be a surprise "worth the wait," calling it "simply incredible" and indicating that the product will deliver a new computing paradigm. A few days ago, Altman tweeted that "someday soon something smarter than the smartest person you know will be running on a device in your pocket, helping you with whatever you want. This is a very remarkable thing." This seemed to be another teaser for the ChatGPT hardware. Altman did not actually show the ChatGPT device during the dinner with reporters. Also, Altman and Ive did not show the gadget during their teaser announcement in May, either. Reports that followed shared unofficial details about it. The unnamed, mysterious ChatGPT device isn't a phone, and it isn't a pair of smart glasses like Meta's Ray-Ban and Oakley models. Reports have described it as an iPod-like device that you wear around the neck, keep in your pocket, or place on your desk. OpenAI might unveil the gadget in late 2026, with the CEO aiming to ship 100 million units faster than any product before. Read more: ChatGPT Has A Built-In 'Hack' That Makes Your Prompts So Much Better What We Know About The ChatGPT Io Device "Jony recently gave me one of the prototypes of the device for the first time to take home, and I've been able to live with it, and I think it is the coolest piece of technology that the world will have ever seen," Altman said during the mid-May video, where he and Jony Ive confirmed the ChatGPT-io partnership. io is the name of Ive's AI startup, the company that worked on the ChatGPT device prototypes. OpenAI purchased io Products for $6.5 billion. During the same chat with Jony Ive, Altman explained the reason why ChatGPT needs its own hardware. To access ChatGPT right now, you need to use a computer and load up the ChatGPT website in a browser or open a standalone app. The implication was that a specialized piece of hardware built around ChatGPT would make interacting with the AI a faster process. Altman gave OpenAI employees an internal presentation of the ChatGPT hardware in the days that followed. The CEO reportedly described it as an AI companion, saying that he wants OpenAI to sell 100 million units faster than any first-generation products from any other company. The CEO also said the device will be unobtrusive. Small enough to carry in a pocket, the AI hardware should become a person's third "core device" after the iPhone and MacBook. The gadget will be aware of the user's context, ready to assist. OpenAI COO Brad Lightcap said in a separate interview that OpenAI is developing an ambient computer layer for the ChatGPT device that will allow the AI to understand how a user is interacting with other people in various instances. Read the original article on BGR.

Conservative-leaning AI platform Perplexity makes shock bid to buy 'rival' Google Chrome

Yahoo

13 minutes ago

Yahoo

Conservative-leaning AI platform Perplexity makes shock bid to buy 'rival' Google Chrome

Perplexity AI, one of the leading AI platforms along with ChatGPT, Claude and Google Gemini, made an unsolicited bid to purchase the Chrome browser as Google faces charges in US courts of having a monopoly on online searches. In a letter to Sundar Pichai, CEO of Alphabet, Google's parent company, Perplexity offered $34.5 billion (€29bn) in cash for Chrome, according to a term sheet seen by Reuters. The offer is particularly shocking because Perplexity is "only" worth $18 billion (€15.35bn). Perplexity's spokesperson confirmed the all-cash offer reported by the Wall Street Journal. Who are Perplexity AI? The AI platform delivers responses in conversational language it says is easy for the public to understand, setting itself apart from Google and Bing by skipping SEO-driven ranked link lists, and from ChatGPT or Gemini by using live searches instead of static snapshots of the internet. Earlier in August, Truth Social—the social media platform owned by US President Donald Trump—announced it was beta-testing integrating Perplexity AI into its search engine as Truth Search AI. While Perplexity maintains that it only provides the underlying technology for Truth Search AI and does not control "editorial" decisions, Truth Search has so far favoured conservative sources such as Fox News, The Epoch Times and The Federalist. While often framed as politically neutral, phrases like 'democratising knowledge'—which Perplexity have said they plan to do—have also been co-opted in some right-wing tech and media circles to suggest breaking perceived gatekeeper control and giving 'the people' unfettered access to information outside of mainstream institutions. Related Trump reverses course on Intel CEO amid US-China chip showdown Warning signs in Europe's job market: Workers now brace for tariff effects Google faces anti-trust charges In one of the biggest anti-monopoly cases of the modern tech era, United States vs. Google LLC, a US district judge ruled in August 2024 that Google had illegally maintained a monopoly on search engines in violation with the Sherman Act. Namely, Google had used illegal means or those in opposition to open, free market practices to maintain dominance by spending billions of dollars per year to make itself the default search engine on Apple's Safari browsers and Android devices, making it impossible for competitors such as Bing or DuckDuckGo to reach users at any significant scale. This locked Google into a dominance loop that others were unable to break into. Being the default browser brought Google more users, which gave it more data to make its search and ads better, which would then encourage people to keep using Google—making it even harder for anyone else to catch up. After the August 2024 ruling, the case moved into a remedies phase where the US Justice Department proposed structural fixes—including forcing Google to sell its Chrome browser, end default search deals and share search data with rivals. In November of last year, Judge Amit Mehta rejected Google's attempt to dismiss some of the some of those proposals, which kept a potential Chrome divestiture on the table and set the stage for final remedy hearings in 2025—which is where the Perplexity offer came in. Related AI browsers share sensitive personal data, new study finds Perplexity's opposition to Google's dominance Perplexity's leadership has explicitly named Google as a rival. In an interview for TIME magazine in April of last year, CEO Aravind Srinivas said that Google was its "main competitor" and that Google's ad-based profit model prevents the integration of AI responses into search. Because Google's search business depends on showing ads alongside search results or links, replacing those results with quick, AI-generated answers—which is what Perplexity does—could undercut Google's revenue. Jeff Bezos, the founder of Amazon, is an investor in Perplexity, and the company relies on Microsoft's Azure AI platform for its infrastructure. While Perplexity claims to have secured 'multiple unnamed funds' to support its all-cash bid for Chrome, there's so far no indication that either Bezos or Microsoft is directly financing the bid.

Researchers suggest OpenAI trained AI models on paywalled O'Reilly books

Hashtags

Try Our AI Features

Comments

Related Articles

Sam Altman admits OpenAI ‘totally screwed up' its GPT-5 launch and says the company will spend trillions of dollars on data centers

Sam Altman Says Mysterious ChatGPT Device Is So Beautiful, It Won't Need A Protective Case

Conservative-leaning AI platform Perplexity makes shock bid to buy 'rival' Google Chrome

Get Started Now: Download the App