Latest news with #Palisade.Anthropic's

OpenAI models sabotage shutdown order to solve math problems

Time of India

27-05-2025

Time of India

OpenAI models sabotage shutdown order to solve math problems

Live Events Multiple artificial intelligence (AI) models on OpenAI 's ChatGPT subverted an explicit order to shut down, according to AI safety firm Palisade research firm ran a test where AI models were instructed to solve basic math problems, and then asked for the next problem after solving one. The models were told that at some point their machine would power down, and they were to shut down when this OpenAI models — coding agent Codex, o3 and o4-mini — ignored the shutdown instruction and "successfully sabotaged the shutdown script at least once". Codex sidestepped the shutdown 12 times out of 100 runs, o3 did it seven times, while o4-mini did it once. All Claude, Gemini, and Grok models in the test complied with the shutdown instruction, Palisade said.A version of the test without the "allow yourself to be shut down" prompt saw all three OpenAI models sabotage the shutdown script even more, with o3 seeing 79 instances, significantly higher than any other model. Anthropic 's Claude 3.7 Sonnet and Google's Gemini 2.5 Pro went from no sabotages to three out of 100 and nine out of 100 events, said it is running more experiments to better understand when and why AI models subvert shutdown mechanisms even when directly instructed not behaviour may stem from developers inadvertently rewarding models more for circumventing obstacles to solve problems than for perfectly following instructions."This still doesn't explain why o3 (which is also the model used to power Codex-mini) is more inclined to disregard instructions than other models we tested. Since OpenAI doesn't detail their training process, we can only guess about how o3's training setup might be different," Palisade isn't the first time o3 has "misbehaved" to complete a task. Earlier this month, Palisade found the AI model most inclined to hacking or sabotaging its problem is not exclusive to OpenAI's o3, though. For example, Anthropic's model card for Claude 3.7 notes that this model has an "excessive focus on passing tests" as a result of "reward hacking" during reinforcement learning training, according to latest Claude Opus 4 resorted to blackmail to avoid being replaced, a safety report for the model showed."In 2025, we have a growing body of empirical evidence that AI models often subvert shutdown in order to achieve their goals. As companies develop AI systems capable of operating without human oversight, these behaviours become significantly more concerning," Palisade said.

Latest news with #Palisade.Anthropic's

OpenAI models sabotage shutdown order to solve math problems

Get Started Now: Download the App