logo
Here's the list of websites gig workers used to fine-tune Anthropic's AI models. Its contractor left it wide open.

Here's the list of websites gig workers used to fine-tune Anthropic's AI models. Its contractor left it wide open.

An internal spreadsheet obtained by Business Insider shows which websites Surge AI gig workers were told to mine — and which to avoid — while fine-tuning Anthropic's AI to make it sound more "helpful, honest, and harmless."
The spreadsheet allows sources like Bloomberg, Harvard University, and the New England Journal of Medicine while blacklisting others like The New York Times and Reddit.
Anthropic says it wasn't aware of the spreadsheet and said it was created by a third-party vendor, the data-labeling startup Surge AI, which declined to comment on this point.
"This document was created by a third-party vendor without our involvement," an Anthropic spokesperson said. "We were unaware of its existence until today and cannot validate the contents of the specific document since we had no role in its creation."
Frontier AI companies mine the internet for content and often work with startups with thousands of human contractors, like Surge, to refine their AI models.
In this case, project documents show Surge worked to make Anthropic's AI sound more human, avoid "offensive" statements, and cite documents more accurately.
Many of the whitelisted sources copyright or otherwise restrict their content. The Mayo Clinic, Cornell University, and Morningstar, whose main websites were all listed as "sites you can use," told BI they don't have any agreements with Anthropic to use this data for training AI models.
Surge left a trove of materials detailing its work for Anthropic, including the spreadsheet, accessible to anyone with the link on Google Drive. Surge locked down the documents shortly after BI reached out for comment.
"We take data security seriously, and documents are restricted by project and access level where possible," a Surge spokesperson said. "We are looking closely into the matter to ensure all materials are protected."
It's the latest incident in which a data-labeling startup used public Google Docs to pass around sensitive AI training instructions. Surge's competitor, Scale AI, also exposed internal data in this manner, locking the documents down after BI revealed the issue.
A Google Cloud spokesperson told BI that its default setting restricts a company's files from sharing outside the organization; changing this setting is a "choice that a customer explicitly makes," the spokesperson said.
Surge hit $1 billion in revenue last year and is raising funds at a $15 billion valuation, Reuters reported. Anthropic was most recently valued at $61.5 billion, and its Claude chatbot is widely considered a leading competitor to ChatGPT.
What's allowed — and what's not
Google Sheet data showed the spreadsheet was created in November 2024, and it's referenced in updates as recent as May 2025 in other documents left public by Surge.
The list functions as a "guide" for what online sources Surge's gig workers can and can't use on the Anthropic project.
The list includes over 120 permitted websites from a wide range of fields, including academia, healthcare, law, and finance. It includes 10 US universities, including Harvard, Yale, Northwestern, and the University of Chicago.
It also lists popular business news sources, such as Bloomberg, PitchBook, Crunchbase, Seeking Alpha, Investing.com, and PR Newswire.
Medical information sources, such as the New England Journal of Medicine, and government sources, such as a list of UN treaties and the US National Archives, are also in the whitelist. So are university publishers like Cambridge University Press.
Here's the full list of who's allowed, which says that it is "not exhaustive." And here's the list of who is banned — or over 50 "common sources" that are "now disallowed," as the spreadsheet puts it.
The blacklist mostly consists of media outlets like The New York Times, The Wall Street Journal, and others. It also includes other types of sources like Reddit, Stanford University, the academic publisher Wiley, and the Harvard Business Review.
The spreadsheet doesn't explain why some sources are permitted and others are not.
The blacklist could reflect websites that made direct demands to AI companies to stop using their content, said Edward Lee, a law professor at Santa Clara University. That can happen through written requests or through an automated method like robots.txt.
Some sources in the blacklist have taken legal stances against AI companies using their content. Reddit, for example, sued Anthropic this year, saying the AI company accessed its site without permission. Anthropic has denied these claims. The New York Times sued OpenAI, and The Wall Street Journal's parent, Dow Jones, sued Perplexity, for similar reasons.
"The Times has objected to Anthropic's unlicensed use of Times content for AI purposes and has taken steps to block their access as part of our ongoing IP protection and enforcement efforts," the Times spokesperson Charlie Stadtlander told BI.
"As the law and our terms of service make clear, scraping or using the Times's content is prohibited without our prior written permission, such as a licensing agreement."
Surge workers used the list for RLHF
Surge contractors were told to use the list for a later, but crucial, stage of AI model training in which humans rate an existing chatbot's responses to improve them. That process is called "reinforcement learning from human feedback," or RLHF.
The Surge contractors working for Anthropic did tasks like copying and pasting text from the internet, asking the AI to summarize it, and choosing the best summary. In another case, workers were asked to "find at least 5-10 PDFs" from the web and quiz Anthropic's AI about the documents' content to improve its citation skills.
That doesn't involve feeding web data directly into the model for it to regurgitate later — the better-known process that's known as pre-training.
Courts haven't addressed whether there's a clear distinction between the two processes when it comes to copyright law. There's a good chance both would be viewed as crucial to building a state-of-the-art AI model, Lee, the law professor, said.
It is "probably not going to make a material difference in terms of fair use," Lee said.
Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

Meta names ChatGPT co-creator as chief scientist of Superintelligence Lab
Meta names ChatGPT co-creator as chief scientist of Superintelligence Lab

Yahoo

time2 hours ago

  • Yahoo

Meta names ChatGPT co-creator as chief scientist of Superintelligence Lab

By Echo Wang NEW YORK (Reuters) -Meta Platforms has appointed Shengjia Zhao, co-creator of ChatGPT, as chief scientist of its Superintelligence Lab, CEO Mark Zuckerberg said on Friday, as the company accelerates its push into advanced AI. "In this role, Shengjia will set the research agenda and scientific direction for our new lab working directly with me and Alex," Zuckerberg wrote in a Threads post, referring to Meta's Chief AI Officer Alexandr Wang, who Zuckerberg hired from startup Scale AI when Meta took a big stake in it. Zhao, a former research scientist at OpenAI, co-created ChatGPT, GPT-4 and several of OpenAI's mini models, including 4.1 and o3. He is among several researchers who have moved from OpenAI to Meta in recent weeks, part of a broader talent arms race as Zuckerberg aggressively hires from rivals to close the gap in advanced AI. Meta has been offering some of Silicon Valley's most lucrative pay packages and striking startup deals to attract top researchers, a strategy that follows the underwhelming performance of its Llama 4 model. Meta launched the Superintelligence Lab recently to consolidate work on its Llama models and long‑term artificial general intelligence ambitions. Zhao is a co-founder of the lab, according to the Threads post, which operates separately from FAIR, Meta's established AI research division led by deep learning pioneer Yann LeCun. Zuckerberg has said Meta aims to build 'full general intelligence' and release its work as open source — a strategy that has drawn both praise and concern within the AI community.

GPT-5 could be OpenAI's most powerful model yet — here's what early testing reveals
GPT-5 could be OpenAI's most powerful model yet — here's what early testing reveals

Tom's Guide

time3 hours ago

  • Tom's Guide

GPT-5 could be OpenAI's most powerful model yet — here's what early testing reveals

The next major language model for ChatGPT may be closer than we think, and early feedback suggests GPT-5 could be a serious upgrade. According to a new report from The Information, someone who's tested the unreleased model described it as a significant step forward in performance. While OpenAI hasn't confirmed when GPT-5 will launch inside ChatGPT or its API platform, CEO Sam Altman recently acknowledged using the model and enjoying the experience. That alone hints that OpenAI is preparing to roll out a more powerful assistant; one designed to improve in areas where earlier versions have started to plateau. The report suggests GPT-5 blends OpenAI's traditional GPT architecture with elements from its reasoning-focused 'o' models. That would give it the flexibility to adjust how much effort it puts into different tasks, doing quick work on easy queries, but applying deeper reasoning to complex problems. This approach mirrors Anthropic's Claude models, which already let users fine-tune how much 'thinking' the model does. Get instant access to breaking news, the hottest reviews, great deals and helpful tips. In GPT-5's case, this could mean faster responses when you're asking something simple, and more thoughtful output for challenges like debugging code or solving abstract math problems. One of GPT-5's biggest reported strengths is software engineering. According to The Information, the model handles both academic coding challenges and real-world tasks, such as editing complex, outdated codebases, more effectively than previous GPT versions. That could make it especially appealing to developers, many of whom currently rely on competitors like Anthropic's Claude. A person who tested GPT-5 told The Information it outperformed Claude Sonnet 4 in side-by-side comparisons. That's just one data point and Claude Opus 4 is still considered Anthropic's most advanced model, but it signals OpenAI is serious about reclaiming ground in this space. Here's where things get a little murky. Some researchers speculate GPT-5 might not be a single, brand-new model, but instead a routing system that dynamically selects the best model, GPT-style or reasoning-based, depending on your prompt. If that's true, it could signal a shift away from scaling traditional LLMs toward optimizing post-training performance through reinforcement learning and synthetic data. That's where models are fine-tuned using expert feedback after training and it's an area where OpenAI has been investing heavily. If GPT-5 lives up to early reports, it could help OpenAI win back developer mindshare and chip away at Anthropic's dominance in coding assistants; a market that could be worth hundreds of millions annually. It would also strengthen OpenAI's pitch to enterprise users and give its chip suppliers, like Nvidia, another reason to celebrate. For users of ChatGPT, the biggest change could be more efficient and accurate answers across the board, especially for bigger tasks that current models still struggle with. We'll have to wait and see what OpenAI officially announces in the coming weeks, but if GPT-5 is as strong as it sounds, the next wave of AI tools could be the most capable yet. Follow Tom's Guide on Google News to get our up-to-date news, how-tos, and reviews in your feeds. Make sure to click the Follow button.

Mark Zuckerberg names ex-OpenAI employee chief scientist of new Meta AI lab
Mark Zuckerberg names ex-OpenAI employee chief scientist of new Meta AI lab

CNBC

time3 hours ago

  • CNBC

Mark Zuckerberg names ex-OpenAI employee chief scientist of new Meta AI lab

Meta CEO Mark Zuckerberg on Friday said Shengjia Zhao, the co-creator of OpenAI's ChatGPT, will serve as the chief scientist of Meta Superintelligence Labs. Zuckerberg has been on a multibillion-dollar artificial intelligence hiring blitz in recent weeks, highlighted by a $14 billion investment in Scale AI. In June, Zuckerberg announced a new organization called Meta Superintelligence Labs that's made up of top AI researchers and engineers. Zhao's name was listed among other new hires in the June memo, but Zuckerberg said Friday that Zhao co-founded the lab and "has been our lead scientist from day one." Zhao will work directly with Zuckerberg and Alexandr Wang, the former CEO of Scale AI who is acting as Meta's chief AI officer. "Shengjia has already pioneered several breakthroughs including a new scaling paradigm and distinguished himself as a leader in the field," Zuckerberg wrote in a social media post. "I'm looking forward to working closely with him to advance his scientific vision." In addition to co-creating ChatGPT, Zhao helped build OpenAI's GPT-4, mini models, 4.1 and o3, and he previously led synthetic data at OpenAI, according to Zuckerberg's June memo. Meta Superintelligence Labs will be where employees work on foundation models such as the open-source Llama family of AI models, products and Fundamental Artificial Intelligence Research projects. The social media company will invest "hundreds of billions of dollars" into AI compute infrastructure, Zuckerberg said earlier this month. "The next few years are going to be very exciting!" Zuckerberg wrote Friday.

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store