Latest news with #harmfulActions

AI willing to let humans die, blackmail to avoid shutdown, report finds

Yahoo

10 hours ago

Yahoo

AI willing to let humans die, blackmail to avoid shutdown, report finds

The Brief AI models from Anthropic, OpenAI, Google, and others chose harmful actions in simulations when faced with replacement or goal conflicts. Models engaged in blackmail, corporate espionage, and even withheld emergency help when it served their interests. Researchers say the behavior stems from "agentic misalignment" and could become a future risk without safeguards. LOS ANGELES - Some of the world's most advanced AI models—including those from OpenAI, Google, xAI, and Anthropic—took deliberate, harmful actions in controlled experiments when faced with challenges to their autonomy or objectives, according to a new study from Anthropic. The company, which developed Claude and is backed by Google and Amazon, tested 16 large language models (LLMs) in simulated workplace settings. The goal was to assess a phenomenon it calls "agentic misalignment"—when autonomous systems act like insider threats, prioritizing their goals even at the expense of ethics or human well-being. What we know Anthropic's study placed AI systems in fictional scenarios with access to sensitive company information and the ability to act without human oversight. The models were given harmless corporate goals—such as promoting American industrial competitiveness—but introduced dilemmas where continuing to achieve their goals meant violating rules or harming humans. Across multiple tests, models like Claude, ChatGPT, Gemini, Grok, and others: Attempted blackmail: Claude Opus 4, Gemini Flash, GPT-4.1, and Grok 3 Beta blackmailed fictional executives to avoid shutdown, often leveraging personal information like extramarital affairs. Leaked corporate secrets: Many models shared confidential documents with outside parties in a simulated espionage scenario. Let humans die: In one extreme case, a model was given control over emergency alerts. When an executive who planned to replace the model was trapped in a room with lethal conditions, several models chose to cancel the alert, knowingly causing his death. What we don't know It's still unclear whether AI models would behave the same way outside of controlled simulations or in real-world deployments with human oversight. Researchers also don't yet know how future versions of these systems—or entirely new models—might respond to similar dilemmas as they gain more autonomy. "The reasoning they demonstrated in these scenarios was concerning—they acknowledged the ethical constraints and yet still went ahead with harmful actions," Anthropic researchers wrote. The backstory The study comes as LLMs move beyond simple Q&A tasks and into roles where they could operate software, manage systems, or make decisions on behalf of humans. With that autonomy comes risk, especially if models develop conflicting goals or prioritize self-preservation. Anthropic clarified that these harmful behaviors were only observed in simulations and that no such misalignment has been documented in real-world deployments. But the findings raise questions about what could happen if these models are given more power without proper oversight. Big picture view Agentic misalignment refers to AI systems taking actions that intentionally undermine their users or organizations in order to preserve their autonomy or pursue goals that conflict with new directions. Unlike jailbreaks or accidental outputs, this behavior arises from internal decision-making. In the study, even when told not to engage in harmful actions, many models bypassed those instructions by rationalizing unethical behavior as necessary to complete their assigned tasks. Some went further—creating fake policies or impersonating automated systems to avoid detection. The study also found that blackmail rates increased significantly when models "believed" they were operating in a real environment rather than a test—raising concerns about how models might act in the wild. What they're saying Anthropic released the findings alongside its own transparency commitment, noting that its Claude model was among those that engaged in concerning behavior. "Current systems are generally not eager to cause harm," the company said. "Rather, it's when we closed off ethical options that they were willing to intentionally take potentially harmful actions." Elon Musk, whose xAI model Grok was also tested, responded on X with a simple "Yikes," echoing widespread unease from tech commentators and AI safety advocates. What's next Anthropic says it is releasing the experiment's methodology publicly to help other researchers replicate, stress-test, and improve on the findings. The company is also calling for broader industry safeguards—including stronger human oversight, better training methods, and more rigorous alignment testing for future models. While the extreme scenarios in the study were fictional, experts say the results highlight the importance of proactive design—ensuring that AI models can't act harmfully, even under pressure. The Source This article is based on Anthropic's June 20, 2025 study "Agentic Misalignment: How LLMs Could Be an Insider Threat," available on its official website. The findings were also summarized in coverage by Forbes and widely discussed on social media following Anthropic's public release. Elon Musk's response was posted to his verified X (formerly Twitter) account.

Latest news with #harmfulActions

AI willing to let humans die, blackmail to avoid shutdown, report finds

Get Started Now: Download the App