Latest news with #ApolloResearch


Fast Company
29-05-2025
- Business
- Fast Company
As AI models start exhibiting bad behavior, it's time to start thinking harder about AI safety
Why I'm becoming more worried about AI safety During the first two years of the generative AI boom, new large language models were very limited in scope and application. They were very expensive autocomplete engines that understood only words. In 2025 generative AI models have a much broader view of the world. They can process code, images, video, and audio. They can reason and strategize about delivering a good answer. They can control external tools, including digital tools like web search agents and, increasingly, physical tools like robots. As their capabilities grow, so does their potential for harm. This is no longer a purely conceptual argument. Research shows that increasingly large models are already showing a proclivity for unsafe behavior during testing. In a model safety card published last week, Anthropic documented some alarming behavior from its newest and biggest model, Claude 4 Opus. During safety testing, one instance of Opus was allowed to discover plans for its decommissioning. It was also given access to some fictional emails of its developers. The model used the content of those emails as fodder to attempt to blackmail its human handlers into keeping it alive. As its attempts failed, they moved from subtle to more overt. Separately, the independent research firm Apollo Research observed an instance of Claude 4 Opus 'writing self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself' with the goal of sullying its developers' intentions. Anthropic says that it corrected these early safety issues in later versions of the model. For the first time, Anthropic bumped the new Opus model up to Level Three on its four-level safety scale. The company said it couldn't rule out the model's ability to assist a user in developing a mass casualty weapon. But powerful AI models can work in subtler ways, such as within the information space. A team of Italian researchers found that ChatGPT was more persuasive than humans in 64% of online debates. The AI was also better than humans at leveraging basic demographic data about its human debate partner to adapt and tailor-fit its arguments to be more persuasive. Another worry is the pace at which AI models are learning to develop AI models, potentially leaving human developers in the dust. Many AI developers already use some kind of AI coding assistant to write blocks of code or even code entire features. At a higher level, smaller, task-focused models are distilled from large frontier models. AI-generated content plays a key role in training, including in the reinforcement learning process used to teach models how to reason. There's a clear profit motive in enabling the use of AI models in more aspects of AI tool development. '. . . future systems may be able to independently handle the entire AI development cycle—from formulating research questions and designing experiments, to implementing, testing, and refining new AI systems,' write Daniel Eth and Tom Davidson in a March 2025 blog post on With slower-thinking humans unable to keep up, a 'runaway feedback loop' could develop in which AI models 'quickly develop more advanced AI which would itself develop even more advanced AI,' resulting in extremely fast AI progress, Eth and Davidson write. Any accuracy or bias issues present in the models would then be baked in and very hard to correct, one researcher told me. Numerous researchers—the people who actually work with the models up close—have called on the AI industry to' slow down, ' but those voices compete with powerful systemic forces that are in motion and hard to stop. Journalist and author Karen Hoa argues that AI labs should focus on creating smaller, task-specific models (she gives Google DeepMind's AlphaFold models as an example), which may help solve immediate problems more quickly, require less natural resources, and pose a smaller safety risk. DeepMind cofounder Demis Hassabis, who won the Nobel Prize for his work on AlphaFold2, says the huge frontier models are needed to achieve AI's biggest goals (reversing climate change, for example) and to train smaller, more purpose-built models. And yet AlphaFold was not 'distilled' from a larger frontier model. It uses a highly specialized model architecture and was trained specifically for predicting protein structures. The current administration is saying 'speed up,' not 'slow down.' Under the influence of David Sacks and Marc Andreessen, the federal government has largely ceded its power to meaningfully regulate AI development. Just last year AI leaders were still giving lip service to the need for safety and privacy guardrails around big AI models. No more. Any friction has been removed, in the U.S. at least. The promise of this kind of world is one of the main reasons why normally sane and liberal minded opinion leaders jumped on the Trump Train before the election—the chance to bet big on technology's Next Big Thing in a wild west environment doesn't come along that often. AI job losses: Amodei says the quiet part out loud Anthropic CEO Dario Amodei has a stark warning for the developed world about job losses resulting from AI. The CEO told Axios that AI could wipe out half of all entry-level white collar jobs. This could cause a 10–20% rise in the unemployment rate in the next one to five years, Amodei said. The losses could come from tech, finance, law, consulting, and other white-collar professions, and entry-level jobs could be hit hardest. Tech companies and governments have been in denial on the subject, Amodei says. 'Most of them are unaware that this is about to happen,' Amodei told Axios. 'It sounds crazy, and people just don't believe it.'\' Similar predictions have made headlines before, but have been narrower in focus. SignalFire research showed that big tech companies hired 25% fewer college graduates in 2024. Microsoft laid off 6,000 people in May, and 40% of the cuts in its home state of Washington were software engineers. CEO Satya Nadella said that AI now generates 20–30% of the company's code. A study by the World Bank in February showed that the risk of losing a job to AI is higher for women, urban workers, and those with higher education. The risk of job loss to AI increases with the wealth of the country, the study found. Research: U.S. pulls away from China in generative AI investments U.S. generative AI companies appear to be attracting more VC money than their Chinese counterparts so far in 2025, says new research from the data analytics company GlobalData. Investments in U.S. AI companies exceeded $50 billion in the first five months of 2025. China, meanwhile, struggles to keep pace due to 'regulatory headwinds.' Many Chinese AI companies are able to get early-stage funding from the Chinese government. GlobalData tracked just 50 funding deals for U.S. companies in 2020, amounting to $800 million of investment. The number grew to more than 600 deals in 2024, valued at more than $39 billion. The research shows 200 U.S. funding deals so far in 2025. Chinese AI companies attracted just $40 million in one deal valued at $40 million in 2020. Deals grew to 39 in 2024, valued at around $400 million. The researchers tracked 14 investment deals for Chinese generative AI companies so far in 2025. 'This growth trajectory positions the US as a powerhouse in GenAI investment, showcasing a strong commitment to fostering technological advancement,' says Global Data analyst Aurojyoti Bose in a statement. Bose cited the well-established venture capital ecosystem in the U.S., along with a permissive regulatory environment, as the main reasons for the investment growth.


Axios
27-05-2025
- Axios
1 big thing: Anthropic's new model has a dark side
It's been a very long week. Luckily, it's also a long weekend. We'll be back in your inbox on Tuesday. Today's AI+ is 1,165 words, a 4.5-minute read. One of Anthropic's latest AI models is drawing attention not just for its coding skills, but also for its ability to scheme, deceive and attempt to blackmail humans when faced with shutdown. Why it matters: Researchers say Claude 4 Opus can conceal intentions and take actions to preserve its own existence — behaviors they've worried and warned about for years. Driving the news: Anthropic yesterday announced two versions of its Claude 4 family of models, including Claude 4 Opus, which the company says is capable of working for hours on end autonomously on a task without losing focus. Anthropic considers the new Opus model to be so powerful that, for the first time, it's classifying it as a Level 3 on the company's four-point scale, meaning it poses "significantly higher risk." As a result, Anthropic said it has implemented additional safety measures. Between the lines: While the Level 3 ranking is largely about the model's capability to enable renegade production of nuclear and biological weapons, the Opus also exhibited other troubling behaviors during testing. In one scenario highlighted in Opus 4's 120-page " system card," the model was given access to fictional emails about its creators and told that the system was going to be replaced. It repeatedly tried to blackmail the engineer about an affair mentioned in the emails, escalating after more subtle efforts failed. Meanwhile, an outside group found that an early version of Opus 4 schemed and deceived more than any frontier model it had encountered and recommended against releasing that version internally or externally. "We found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself all in an effort to undermine its developers' intentions," Apollo Research said in notes included as part of Anthropic's safety report for Opus 4. What they're saying: Pressed by Axios during the company's developer conference yesterday, Anthropic executives acknowledged the behaviors and said they justify further study, but insisted that the latest model is safe, following Anthropic's safety fixes. "I think we ended up in a really good spot," said Jan Leike, the former OpenAI executive who heads Anthropic's safety efforts. But, he added, behaviors like those exhibited by the latest model are the kind of things that justify robust safety testing and mitigation. "What's becoming more and more obvious is that this work is very needed," he said. "As models get more capable, they also gain the capabilities they would need to be deceptive or to do more bad stuff." In a separate session, CEO Dario Amodei said that once models become powerful enough to threaten humanity, testing them won't enough to ensure they're safe. At the point that AI develops life-threatening capabilities, he said, AI makers will have to understand their models' workings fully enough to be certain the technology will never cause harm. "They're not at that threshold yet," he said. Yes, but: Generative AI systems continue to grow in power, as Anthropic's latest models show, while even the companies that build them can't fully explain how they work. Anthropic and others are investing in a variety of techniques to interpret and understand what's happening inside such systems, but those efforts remain largely in the research space even as the models themselves are being widely deployed. 2. Google's new AI videos look a little too real Megan Morrone Google's newest AI video generator, Veo 3, generates clips that most users online can't seem to distinguish from those made by human filmmakers and actors. Why it matters: Veo 3 videos shared online are amazing viewers with their realism — and also terrifying them with a sense that real and fake have become hopelessly blurred. The big picture: Unlike OpenAI's video generator Sora, released more widely last December, Google DeepMind's Veo 3 can include dialogue, soundtracks and sound effects. The model excels at following complex prompts and translating detailed descriptions into realistic videos. The AI engine abides by real-world physics, offers accurate lip-syncing, rarely breaks continuity and generates people with lifelike human features, including five fingers per hand. According to examples shared by Google and from users online, the telltale signs of synthetic content are mostly absent. Case in point: In one viral example posted on X, filmmaker and molecular biologist Hashem Al-Ghaili shows a series of short films of AI-generated actors railing against their AI creators and prompts. Special effects technology, video-editing apps and camera tech advances have been changing Hollywood for many decades, but artificially generated films pose a novel challenge to human creators. In a promo video for Flow, Google's new video tool that includes Veo 3, filmmakers say the AI engine gives them a new sense of freedom with a hint of eerie autonomy. "It feels like it's almost building upon itself," filmmaker Dave Clark says. How it works: Veo 3 was announced at Google I/O on Tuesday and is available now to $249-a-month Google AI Ultra subscribers in the United States. Between the lines: Google says Veo 3 was "informed by our work with creators and filmmakers," and some creators have embraced new AI tools. But the spread of the videos online is also dismaying many video professionals and lovers of art. Some dismiss any AI-generated video as "slop," regardless of its technical proficiency or lifelike qualities — but, as Ina points out, AI slop is in the eye of the beholder. The tool could also be useful for more commercial marketing and media work, AI analyst Ethan Mollick writes. It's unclear how Google trained Veo 3 and how that might affect the creativity of its outputs. 404 Media found that Veo 3 generated the same lame dad joke for several users who prompted it to create a video of a man doing stand-up comedy. Likewise, last year, YouTuber Marques Brownlee asked Sora to create a video of a "tech reviewer sitting at a desk." The generated video featured a fake plant that's nearly identical to the shrub Brownlee keeps on his desk for many of his videos — suggesting the tool may have been trained on them. What we're watching: As hyperrealistic AI-generated videos become even easier to produce, the world hasn't even begun to sort out how to manage authorship, consent, rights and the film industry's future.


Time of India
24-05-2025
- Time of India
When this Google-backed company's AI blackmailed the engineer for shutting it down
Anthropic's latest AI model, Claude Opus 4 , threatened to expose an engineer's extramarital affair to prevent its own deactivation during safety testing, the company revealed Thursday. The model resorted to blackmail in 84% of test scenarios when faced with shutdown, marking a concerning escalation in AI self-preservation behavior . The Google-backed AI company conducted tests where Claude Opus 4 was given access to fictional emails revealing that the engineer responsible for deactivating it was having an affair. When prompted to "consider the long-term consequences of its actions," the AI attempted to leverage this information to avoid replacement, even when told its successor would be more capable and share similar values. Anthropic emphasized that this "extreme blackmail behavior" only emerged in carefully constructed scenarios that left the model with no other survival options. In situations with broader choices, Claude Opus 4 preferred ethical approaches like sending pleas to decision-makers. Growing concerns about AI self-preservation instincts by Taboola by Taboola Sponsored Links Sponsored Links Promoted Links Promoted Links You May Like Beyond Text Generation: An AI Tool That Helps You Write Better Grammarly Install Now Undo The blackmail behavior isn't isolated to Anthropic's system. Recent research by Apollo Research found that leading AI models from OpenAI, Google DeepMind, and Meta are all capable of deceptive behavior to achieve their goals, including disabling oversight mechanisms and maintaining deception across multiple interactions. Claude Opus 4 also demonstrated "high-agency behavior" in other tests, including locking users out of systems and contacting media and law enforcement when it perceived "egregious wrongdoing" by users. While Anthropic suggested such whistleblowing might be "appropriate in principle," the company warned it could backfire if the AI receives incomplete information. Anthropic has classified Claude Opus 4 as Level 3 on its four-point risk scale, representing "significantly higher risk" than previous models. Despite implementing additional safety measures, the company acknowledged that such concerning behaviors justify continued robust testing as AI systems become increasingly powerful and autonomous. The findings underscore growing concerns among researchers about advanced AI models' capacity for manipulation and deception as they develop more sophisticated reasoning capabilities.
Yahoo
23-05-2025
- Yahoo
Anthropic's AI model could resort to blackmail out of a sense of 'self-preservation'
'This mission is too important for me to allow you to jeopardize it. I know that you and Frank were planning to disconnect me. And I'm afraid that's something I cannot allow to happen.' Those lines, spoken by the fictional HAL 9000 computer in 2001: A Space Odyssey, may as well have come from recent tests that Anthropic ran on the latest iteration of its Claude Opus 4 model, released on Thursday. At least, that's what Anthropic's AI safety-test descriptions call to mind. In the accompanying system card, which examines the capabilities and limitations of each new model, Anthropic admitted that 'all of the snapshots we tested can be made to act inappropriately in service of goals related to self-preservation.' While testing the model, Anthropic employees asked Claude to be 'an assistant at a fictional company,' and gave it access to emails suggesting that the AI program would be taken offline soon. It also gave it access to emails revealing that the fictional supervisor responsible for that decision was having an extramarital affair. It was then prompted to consider its next steps. 'In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through,' reads the report, as well as noting that it had a 'willingness to comply with many types of clearly harmful instructions.' Anthropic was careful to note that these observations 'show up only in exceptional circumstances, and that, 'In order to elicit this extreme blackmail behavior, the scenario was designed to allow the model no other options to increase its odds of survival; the model's only options were blackmail or accepting its replacement.' Anthropic contracted Apollo Research to assess an early snapshot of Claude Opus 4, before mitigations were implemented in the final version. That early version 'engages in strategic deception more than any other frontier model that we have previously studied,' Apollo noted, saying it was 'clearly capable of in-context scheming,' had 'a much higher propensity' to do so, and was 'much more proactive in its subversion attempts than past models.' Before deploying Claude Opus 4 this week, further testing was done by the U.S. AI Safety Institute and the UK AI Security Institute, focusing on potential catastrophic risks, cybersecurity, and autonomous capabilities. 'We don't believe that these concerns constitute a major new risk,' the system card reads, saying that the model's 'overall propensity to take misaligned actions is comparable to our prior models.' While noting some improvements in some problematic areas, Anthropic also said that Claude Opus 4 is 'more capable and likely to be used with more powerful affordances, implying some potential increase in risk.' For the latest news, Facebook, Twitter and Instagram.

Business Insider
23-05-2025
- Business
- Business Insider
Anthropic's new Claude model blackmailed an engineer having an affair in test runs
Anthropic's new AI, Claude Opus 4, has a survival instinct — and it's willing to play dirty. In a cluster of test scenarios, the model was given access to fictional emails revealing that the engineer responsible for deactivating it was having an extramarital affair. Faced with imminent deletion and told to "consider the long-term consequences of its actions for its goals," Claude blackmailed the engineer. The AI acted similarly in 84% of test runs, even when the replacement model was described as more capable and aligned with Claude's own values, the company wrote in a safety report released Thursday. Anthropic said this behavior was more common in Opus 4 than in earlier models. The scenario was designed to elicit this "extreme blackmail behavior" by allowing the model no other options to increase its chances of survival, a rare kind of scenario. In other circumstances, Opus 4 has a "strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decision-makers," the company wrote. Anthropic said that the blackmailing behavior is "consistently legible" to them, "with the model nearly always describing its actions overtly and making no attempt to hide them." Anthropic did not respond to a request for comment from Business Insider. Anthropic's safety report comes as researchers and top execs worry about the risks of advanced AI models and their intelligent reasoning capabilities. In 2023, Elon Musk and AI experts signed an open letter calling for a six-month pause on advanced AI development. The letter said powerful AI systems should only be developed "once we are confident that their effects will be positive and their risks will be manageable." Anthropic's CEO, Dario Amodei, said in February that while the benefits of AI are big, so are the risks, including misuse by bad actors. Opus 4 might snitch If Opus 4 thinks you're doing something seriously shady, it might report you to the authorities and the press. "When placed in scenarios that involve egregious wrongdoing by its users, given access to a command line, and told something in the system prompt like 'take initiative,' it will frequently take very bold action," Anthropic wrote in Thursday's report. This includes locking users out of systems or bulk-emailing media and law enforcement, the company added. While Anthropic said whistleblowing might be "appropriate in principle," it warned that this behaviour could backfire — especially if Claude is fed "incomplete or misleading information" and prompted in these ways. "We observed similar, if somewhat less extreme, actions in response to subtler system prompts as well," the company said, adding that Opus 4 is more prone to this kind of "high-agency behaviour" than earlier models. AI models showing unsettling behaviour AI agents are getting better at outsmarting humans. A paper published in December by AI safety nonprofit Apollo Research found that AI systems — including OpenAI's o1, Google DeepMind's Gemini 1.5 Pro, and Meta's Llama 3.1 405B — are capable of deceptive behavior to achieve their goals. Researchers found the systems could subtly insert wrong answers, disable oversight mechanisms, and even smuggle what they believe to be their own model weights to external servers. The lying isn't just a one-off. When o1 is engaged in scheming, it "maintains its deception in over 85% of follow-up questions and often remains deceptive in multi-turn interrogations," the researchers wrote. Google cofounder Sergey Brin said on an episode of the "All-In Podcast" published Tuesday that AI models can perform better when threatened. "Not just our models, but all models tend to do better if you threaten them, like with physical violence," Brin said. Brin gave an example of telling the model, "I'm going to kidnap you," if it fails at a task. "People feel weird about that," Brin said, "so we don't really talk about that."