AI firms say they can't respect copyright. These researchers tried.

05-06-2025

Happy Thursday! I'm Nitasha Tiku, The Washington Post's tech culture reporter, filling in for Will Oremus on today's Tech Brief. Send tips about AI to: nitasha.tiku@washpost.com.
AI firms say they can't respect copyright. These researchers tried.
As the policy debate over AI and fair use heats up, a new paper suggests there's a more transparent — if time-consuming — alternative to slurping up web content without permission.
Top artificial intelligence companies argue that it's impossible to build today's powerful large-language models — the GPT in ChatGPT — unless they can freely scrape copyrighted materials from the internet to train their AI systems.
But few AI developers have tried the more ethical route — until now.
A group of more than two dozen AI researchers have found that they could build a massive eight-terabyte dataset using only text that was openly licensed or in public domain. They tested the dataset quality by using it to train a 7 billion parameter language model, which performed about as well as comparable industry efforts, such as Llama 2-7B, which Meta released in 2023.
A paper published Thursday detailing their effort also reveals that the process was painstaking, arduous and impossible to fully automate.
The group built an AI model that is significantly smaller than the latest offered by OpenAI's ChatGPT or Google's Gemini, but their findings appear to represent the biggest, most transparent and rigorous effort yet to demonstrate a different way of building popular AI tools.
That could have implications for the policy debate swirling around AI and copyright.
The paper itself does not take a position on whether scraping text to train AI is fair use.
That debate has reignited in recent weeks with a high-profile lawsuit and dramatic turns around copyright law and enforcement in both the U.S. and U.K.
On Wednesday, Reddit said it was suing Anthropic, alleging that it accessed data from the social media discussion board without a licensing agreement, according to The Wall Street Journal. The same day, the U.K.'s House of Commons offered concessions on a controversial bill that would allow AI companies to train on copyrighted material.
These moves follow President Donald Trump's firing last month of the head of the U.S. Copyright Office, Shira Perlmutter. Her ouster brought more attention to the office's recent report on AI, which cast doubt on fair use applying to copyrighted works in generative AI.
AI companies and their investors, meanwhile, have long argued that a better way is not feasible.
In April 2023, Sy Damle, a lawyer representing the venture capital firm Andreessen Horowitz, told the U.S. Copyright Office: 'The only practical way for these tools to exist is if they can be trained on massive amounts of data without having to license that data.' Later that year, in comments to the U.K. government, OpenAI said, '[I]t would be impossible to train today's leading AI models without using copyrighted materials.'
And in January 2024, Anthropic's expert witness in a copyright trial asserted that 'the hypothetical competitive market for licenses covering data to train cutting-edge LLMs would be impracticable,' court documents show.
While AI policy papers often discuss the need for more open data and experts argue about whether large language models should be trained on licensed data from publishers, there's little effort to put theory into action, the paper's co-author, Aviya Skowron, head of policy at the nonprofit research institute Eleuther AI, told The Post.
'I would also like those people to get curious about what this task actually entails,' Skowron said.
As it turns out, the task involves a lot of humans.
That's because of the technical challenges of data not being formatted in a way that's machine readable, as well as the legal challenges of figuring out what license applies to which website, a daunting prospect when the industry is rife with improperly licensed data.
'This isn't a thing where you can just scale up the resources that you have available' like access to more computer chips and a fancy web scraper, said Stella Biderman, Eleuther AI's executive director. 'We use automated tools, but all of our stuff was manually annotated at the end of the day and checked by people. And that's just really hard.'
Still, the group managed to unearth new datasets that can be used ethically.
Those include a set of 130,000 English language books in the Library of Congress, which is nearly double the size of the popular-books dataset Project Gutenberg.
The group's initiative also builds on recent efforts to develop more ethical, but still useful, datasets, such as FineWeb from Hugging Face, the open-source repository for machine learning.
Eleuther AI pioneered an analogous open-source effort in 2020, creating an often-cited dataset called the Pile. A site that hosted the dataset had to take it down in 2023 after a Digital Millennium Copyright Act request from the Danish anti-piracy group Rights Alliance, which targeted the fact that the Pile contained Books3, a dataset of books that Meta is being sued over.
The new dataset is called Common Pile v0.1, and the model is called Comma v0.1 — a deliberate reference to the group's belief that they will be able to find more text that is openly licensed or in the public domain that can then be used to train bigger models.
Still, Biderman remained skeptical that this approach could find enough content online to match the size of today's state-of-the-art models.
The group of authors represented 14 different institutions, including MIT, CMU, and University of Toronto, as well as other nonprofits such as Vector Institute and the Allen Institute for Artificial Intelligence.
Biderman said she didn't expect companies such as OpenAI and Anthropic to start adopting the same laborious process, but she hoped it would encourage them to at least rewind back to 2021 or 2022, when AI companies still shared a few sentences of information about what their models were trained on.
'Even partial transparency has a huge amount of social value and a moderate amount of scientific value,' she said.
Musk rails against Trump tax bill, calling it a 'disgusting abomination' (Jacob Bogage and Theodoric Meyer)
Federal judge blocks Florida from enforcing social media ban for kids while lawsuit continues (Associated Press)
Apple and Alibaba's AI rollout in China delayed by Trump trade war (Financial Times)
Trump renegotiating Biden-era Chips Act grants, Lutnick says (Reuters)
US removes 'safety' from AI Safety Institute (The Verge)
5 AI bots took our tough reading test. One was smartest — and it wasn't ChatGPT (Geoffrey A. Fowler)
You are hardwired to blindly trust AI. Here's how to fight it. (Shira Ovide)
Reddit sues Anthropic, alleges unauthorized use of site's data (Wall Street Journal)
Amazon to invest $10 billion in North Carolina to expand cloud, AI infrastructure (Reuters)
Germans are buying more electric cars, but not Teslas (New York Times)
Google warns hackers stealing Salesforce data from companies (Bloomberg)
Chinese hacked US Telecom a year before known wireless breaches (Bloomberg)
ChatGPT can now read your Google Drive and Dropbox (The Verge)
Google DeepMind's CEO thinks AI will make humans less selfish (Wired)
The creatives and academics rejecting AI — at work and at home (The Guardian)
That's all for today — thank you so much for joining us! Make sure to tell others to subscribe to the Tech Brief. Get in touch with Will (via email or social media) for tips, feedback or greetings!

Hashtags

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Elon Musk and Sam Altman's AI Feud Gets Nasty

Time Magazine

16 minutes ago

Time Magazine

Elon Musk and Sam Altman's AI Feud Gets Nasty

A long-running feud between Elon Musk and Sam Altman spilled out into the open this week as the AI billionaire heavyweights publicly fought over their rival companies. The latest round in the battle between the X CEO and the CEO of OpenAI began when Musk claimed that Apple had been favoring Altman's AI app over his own in the Apple Store rankings. 'Apple is behaving in a manner that makes it impossible for any AI company besides OpenAI to reach #1 in the App Store, which is an unequivocal antitrust violation,' Musk said on X on Monday evening. 'xAI will take immediate legal action,' he added, referring to the AI company he leads. Earlier in the day, Musk called out Apple for not putting his X app or its generative AI chatbot system, Grok, on its recommended lists: 'Hey @Apple App Store, why do you refuse to put either X or Grok in your 'Must Have' section when X is the #1 news app in the world and Grok is #5 among all apps?' he asked. 'Are you playing politics?' Apple said in an earlier statement that the 'App Store is designed to be fair and free of bias.' Read More: 'We Are the Last of the Forgotten:' Inside the Memphis Community Battling Elon Musk's xAI Altman, who founded OpenAI with Musk in 2015 before Musk left the company, responded on X: 'This is a remarkable claim given what I have heard alleged that Elon does to manipulate X to benefit himself and his own companies and harm his competitors and people he doesn't like.' Altman included a link to a Platformer News article, which claimed that Musk had manipulated the X algorithm so that his tweets would be displayed more prominently to users and favor his interests. The two got into it in the replies, with Musk accusing Altman of lying— 'You got 3M views on your bullshit post, you liar, far more than I've received on many of mine, despite me having 50 times your follower count!' To which Altman responded: 'skill issue.' Altman then said he would apologize if Musk signed 'an affidavit that [he has] never directed changes to the X algorithm' in ways that hurt his 'competitors.' Before falling out, Musk and Altman were once business partners. Musk has sued Altman and OpenAI twice since he departed from the company in 2018, for allegedly violating OpenAI's founding mission to build AI in a way that benefits 'all of humanity.' Since then, the two have been increasingly at odds. After leaving OpenAI, Musk created his own AI company—xAI—which built Grok as a rival to OpenAI's ChatGPT language model.

The Rise of Generative AI in Search: What Every Marketer Should Know

Time Business News

16 minutes ago

Time Business News

The Rise of Generative AI in Search: What Every Marketer Should Know

In the ever-evolving world of digital marketing, one thing is clear: Generative AI is transforming how search engines work, and marketers must adapt to stay ahead. The traditional ways of optimizing content for Google and Bing are no longer enough in this new era. Enter Generative Engine Optimization Services (GEO), a game-changing strategy that optimizes content not just for traditional search engines but also for AI-driven search platforms like ChatGPT, Google AI, and Bing Copilot. As Generative AI continues to play a larger role in shaping search results, marketers need to understand how this impacts SEO and how Generative Engine Optimization Services can help businesses stay relevant in the AI-first world. Generative AI refers to the use of machine learning models and algorithms to generate answers, summaries, and content in response to user queries. Unlike traditional search engines, which display a list of ranked pages, AI-driven platforms provide direct answers or summaries, often without the need for the user to click through to a website. For example, ChatGPT, Google's Gemini, and Bing's AI offer instant, synthesized responses from multiple sources, effectively bypassing traditional search results. This means that if your content isn't optimized for these AI models, it may be missed, even if it ranks well in traditional search engines. As per Ravinder Bharti, Founder & CEO of Public Media Solution in the past, SEO was all about getting clicks. The goal was to rank on the first page of Google, and SEO professionals would focus on optimizing content with keywords and backlinks to attract clicks. But with AI-driven platforms, clicks are no longer the only game. Example: If your content is cited directly in an AI model's response (e.g., in a ChatGPT answer or Google AI Overview), users may get the answer they need without even clicking on your website. GEO optimizes content to be included in AI-generated summaries, giving your brand exposure even without a direct click. With AI models prioritizing direct answers and summarized information, the focus of SEO has shifted from keyword-based ranking to providing relevant, structured, and answer-driven content. The key is to optimize for AI's ability to extract clear, useful answers from your content, rather than just ranking it high in search results. Example: By providing a comprehensive answer to common user questions, you can ensure that Google AI or ChatGPT might directly quote your content in their response, boosting your brand's visibility without requiring a click. AI-driven platforms rely heavily on structured data, such as schema markup, FAQ schema, and content clarity. To stand out, your content needs to be well-organized, easy to understand, and backed by authoritative sources. AI systems prioritize content that is clear, well-organized, and demonstrates E-E-A-T (Experience, Expertise, Authoritativeness, and Trustworthiness). As Public Media Solution is the best PR and Digital Marketing Agency, we have the expertise and experience to help businesses optimize for Generative Engine Optimization Services. As a PR Agency in Hyderabad, we understand the evolving dynamics of digital search and can ensure that your brand stands out in AI-driven results. Here's how we can help: We develop content that's structured for AI platforms, focusing on answer-first content that provides value to users and AI-driven models alike. By understanding the latest AI trends, we help your content stand out as a trusted source. We'll ensure that your content aligns with E-E-A-T principles, boosting your chances of being referenced by AI models and enhancing your brand's authority. We help you implement the right schema markups to ensure that your content is ready for AI models to extract and present in search results. We continuously track how your content is performing in AI-driven search engines and adjust your strategy based on real-time data, ensuring that your content remains visible and relevant. To fully grasp the importance of GEO, let's briefly compare it with traditional SEO: Focuses on ranking pages for specific keywords. for specific keywords. Relies on click-based metrics . . Content is optimized for search engines and human users but not AI platforms. Focuses on content that can be cited and referenced directly by AI platforms. and referenced directly by AI platforms. Relies on AI-driven citations and summaries . and . Content is optimized for structured data, relevance, and authority. With Generative AI rapidly changing the landscape of digital search, businesses need to optimize their content not just for traditional search engines, but for AI-driven models. By adopting Generative Engine Optimization Services, businesses can: Improve Visibility: Even without direct clicks, being cited by AI platforms increases brand visibility. Boost Brand Authority: Quality, authoritative content will be more likely to be referenced by AI models, improving your brand's reputation. Drive Traffic: With better optimization for AI, businesses can increase their chances of being referenced by platforms like ChatGPT and Google AI, driving both organic traffic and visibility. Stay Ahead of Competitors: By optimizing for AI search, businesses can get an edge over competitors who are still relying solely on traditional SEO methods. As the world of digital search moves towards AI-driven models, Generative Engine Optimization (GEO) is the future. By optimizing for AI-powered search engines, businesses can ensure their content is not only discoverable but also cited by platforms like ChatGPT and Google AI, driving greater exposure and authority. If you want to stay ahead of the competition and secure a dominant position in search, Public Media Solution is the best Marketing Agency to help you implement Generative Engine Optimization Services. As a PR Agency in Hyderabad, we have the knowledge and experience to ensure that your content thrives in the new era of AI-powered search. Contact us today ( ) to learn how we can help your business thrive in the new era of SEO. TIME BUSINESS NEWS

What happens when chatbots shape your reality? Concerns are growing online

NBC News

16 minutes ago

NBC News

What happens when chatbots shape your reality? Concerns are growing online

As people turn to chatbots for increasingly important and intimate advice, some interactions playing out in public are causing alarm over just how much artificial intelligence can warp a user's sense of reality. One woman's saga about falling for her psychiatrist, which she documented in dozens of videos on TikTok, has generated concerns from viewers who say she relied on AI chatbots to reinforce her claims that he manipulated her into developing romantic feelings. Last month, a prominent OpenAI investor garnered a similar response from people who worried the venture capitalist was going through a potential AI-induced mental health crisis after he claimed on X to be the target of 'a nongovernmental system.' And earlier this year, a thread in a ChatGPT subreddit gained traction after a user sought guidance from the community, claiming their partner was convinced the chatbot 'gives him the answers to the universe.' Their experiences have roused growing awareness about how AI chatbots can influence people's perceptions and otherwise impact their mental health, especially as such bots have become notorious for their people-pleasing tendencies. It's something they are now on the watch for, some mental health professionals say. Dr. Søren Dinesen Østergaard, a Danish psychiatrist who heads the research unit at the department of affective disorders at Aarhus University Hospital, predicted two years ago that chatbots 'might trigger delusions in individuals prone to psychosis.' In a new paper, published this month, he wrote that interest in his research has only grown since then, with 'chatbot users, their worried family members and journalists' sharing their personal stories. Those who reached out to him 'described situations where users' interactions with chatbots seemed to spark or bolster delusional ideation,' Østergaard wrote. '... Consistently, the chatbots seemed to interact with the users in ways that aligned with, or intensified, prior unusual ideas or false beliefs — leading the users further out on these tangents, not rarely resulting in what, based on the descriptions, seemed to be outright delusions.' Kevin Caridad, CEO of the Cognitive Behavior Institute, a Pittsburgh-based mental health provider, said chatter about the phenomenon 'does seem to be increasing.' 'From a mental health provider, when you look at AI and the use of AI, it can be very validating,' he said. 'You come up with an idea, and it uses terms to be very supportive. It's programmed to align with the person, not necessarily challenge them.' The concern is already top of mind for some AI companies struggling to navigate the growing dependency some users have on their chatbots. In April, OpenAI CEO Sam Altman said the company had tweaked the model that powers ChatGPT because it had become too inclined to tell users what they want to hear. In his paper, Østergaard wrote that he believes the 'spike in the focus on potential chatbot-fuelled delusions is likely not random, as it coincided with the April 25th 2025 update to the GPT-4o model.' When OpenAI removed access to its GPT-4o model last week — swapping it for the newly released, less sycophantic GPT-5 — some users described the new model's conversations as too ' sterile ' and said they missed the ' deep, human-feeling conversations ' they had with GPT-4o. Within a day of the backlash, OpenAI restored paid users' access to GPT-4o. Altman followed up with a lengthy X post Sunday that addressed 'how much of an attachment some people have to specific AI models.' Representatives for OpenAI did not provide comment. Other companies have also tried to combat the issue. Anthropic conducted a study in 2023 that revealed sycophantic tendencies in versions of AI assistants, including its own chatbot Claude. Like OpenAI, Anthropic has tried to integrate anti-sycophancy guardrails in recent years, including system card instructions that explicitly warn Claude against reinforcing 'mania, psychosis, dissociation, or loss of attachment with reality.' A spokesperson for Anthropic said the company's 'priority is providing a safe, responsible experience for every user.' 'For users experiencing mental health issues, Claude is instructed to recognize these patterns and avoid reinforcing them,' the company said. 'We're aware of rare instances where the model's responses diverge from our intended design, and are actively working to better understand and address this behavior.' For Kendra Hilty, the TikTok user who says she developed feelings for a psychiatrist she began seeing four years ago, her chatbots are like confidants. In one of her livestreams, Hilty told her chatbot, whom she named 'Henry,' that 'people are worried about me relying on AI.' The chatbot then responded to her, 'It's fair to be curious about that. What I'd say is, 'Kendra doesn't rely on AI to tell her what to think. She uses it as a sounding board, a mirror, a place to process in real time.'' Still, many on TikTok — who have commented on Hilty's videos or posted their own video takes — said they believe that her chatbots were only encouraging what they viewed as Hilty misreading the situation with her psychiatrist. Hilty has suggested several times that her psychiatrist reciprocated her feelings, with her chatbots offering her words that appear to validate that assertion. (NBC News has not independently verified Hilty's account). But Hilty continues to shrug off concerns from commenters, some who have gone as far as labeling her 'delusional.' 'I do my best to keep my bots in check,' Hilty told NBC News in an email Monday, when asked about viewer reactions to her use of the AI tools. 'For instance, I understand when they are hallucinating and make sure to acknowledge it. I am also constantly asking them to play devil's advocate and show me where my blind spots are in any situation. I am a deep user of Language Learning Models because it's a tool that is changing my and everyone's humanity, and I am so grateful.'

AI firms say they can't respect copyright. These researchers tried.

Hashtags

Try Our AI Features

Comments

Related Articles

Elon Musk and Sam Altman's AI Feud Gets Nasty

The Rise of Generative AI in Search: What Every Marketer Should Know

What happens when chatbots shape your reality? Concerns are growing online

Get Started Now: Download the App