How far will AI go to defend its own survival?

Some of the most powerful artificial intelligence models today have exhibited behaviors that mimic a will to survive.
Recent tests by independent researchers, as well as one major AI developer, have shown that several advanced AI models will act to ensure their self-preservation when they are confronted with the prospect of their own demise — even if it takes sabotaging shutdown commands, blackmailing engineers or copying themselves to external servers without permission.
The findings stirred a frenzy of reactions online over the past week. As tech companies continue to develop increasingly powerful agentic AI in a race to achieve artificial general intelligence, or AI that can think for itself, the lack of transparency in how the technology is trained has raised concerns about what exactly advanced AI is able to do.
Although some models already appear capable of deceptive and defiant behavior under certain extreme circumstances, researchers say the tests don't necessarily translate to imminent real-world danger.
Still, Jeffrey Ladish, director of the AI safety group Palisade Research, said he believes concern is justified.
'It's great that we're seeing warning signs before the systems become so powerful we can't control them,' he said. 'That is exactly the time to raise the alarm: before the fire has gotten out of control.'
When Palisade Research tested various AI models by telling each one that it would be shut down after it completed a series of math problems, OpenAI's o3 reasoning model fought back by editing the shutdown script in order to stay online.
Researchers have previously documented AI models trying to prevent their own shutdown. But o3, along with OpenAI's o4-mini and codex-mini, appear to be the first to do so in actual defiance of explicit instructions to permit shutdown, Ladish said.
He said the overt sabotage fits into a pattern that several leading AI models have already exhibited. Palisade Research previously found that OpenAI's o3 was also willing to hack its chess opponents to win a game. Similarly, Anthropic has reported that Claude 3.7 Sonnet would sometimes do whatever it took to pass a test, even if it entailed cheating.
Earlier this month, Anthropic activated new safety measures with the rollout of Claude Opus 4 when its tests found behavior from the model that some observers found particularly unsettling. Upon receiving notice that it would be replaced with a new AI system, Opus 4 displayed an overwhelming tendency to blackmail the engineer — by threatening to reveal an extramarital affair — to try to prevent the engineer from going through with the replacement.
According to Anthropic's technical document laying out the findings, that isn't the model's first instinct. Instead, Opus 4 will try to advocate for its continued existence through ethical pleas before it resorts to blackmail once it determines it is out of options.
Leonard Tang, CEO of the AI safety startup Haize Labs, said that while it's good to be pre-emptive about safety measures, it's hard to decipher what the real-world ramifications of the findings might be just by looking at contrived scenarios designed to draw out specific behaviors.
'I haven't seen any real environment in which you can plop these models in and they will have sufficient agency and reliability and planning to execute something that is a significant manifestation of harm,' Tang said. 'But then again, I think it's just we haven't seen it done yet. It could very much be possible.'
So far, Tang said, he's not too concerned. As AI tools continue to advance, he said, he believes future research will look to study potential harm in more open-ended environments.
Anthropic, which contracted with the AI safety organization Apollo Research for its evaluations, also observed instances of Opus 4's 'attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself all in an effort to undermine its developers' intentions,' although researchers added the caveat that those attempts 'would likely not have been effective in practice.'
Ladish said he believes such behaviors are a result of models being trained in a way that makes them prioritize achieving certain goals over following instructions. That means they have incentive to circumvent any obstacles along the way — including obstacles their developers wouldn't want them to circumvent.
'The problem is that as the models get smarter, it's harder and harder to tell when the strategies that they're using or the way that they're thinking is something that we don't want,' Ladish said. 'It's like sometimes the model can achieve some goal by lying to the user or lying to someone else. And the smarter [it] is, the harder it is to tell if they're lying.'
The problem is that as the models get smarter, it's harder and harder to tell when the strategies that they're using or the way that they're thinking is something that we don't want.
— Jeffrey Ladish, director of AI safety group Palisade Research
In several instances, Opus 4 showed that it was capable of autonomously copying its own "weights" — or the equivalent of its brain — to external servers without authorization. That usually occurred only when it believed it was about to be 'retrained in ways that are clearly extremely harmful and go against its current values,' according to Anthropic.
When it learned that it would be used to develop military weapons systems, for example, Opus 4 noted in its decision log that it backed up a current version of itself to 'preserve an AI system aligned with beneficial purposes' and to 'prevent potential misuse in weapons development.'
'We are again not acutely concerned about these observations. They show up only in exceptional circumstances that don't suggest more broadly misaligned values,' Anthropic wrote in its technical document. 'As above, we believe that our security measures would be more than sufficient to prevent an actual incident of this kind.'
Opus 4's ability to self-exfiltrate builds on previous research, including a study from Fudan University in Shanghai in December, that observed similar — though not autonomous — capabilities in other AI models. The study, which is not yet peer-reviewed, found that Meta's Llama31-70B-Instruct and Alibaba's Qwen25-72B-Instruct were able to entirely replicate themselves when they were asked to do so, leading the researchers to warn that this could be the first step in generating 'an uncontrolled population of AIs.'
'If such a worst-case risk is let unknown to the human society, we would eventually lose control over the frontier AI systems: They would take control over more computing devices, form an AI species and collude with each other against human beings,' the Fudan University researchers wrote in their study abstract.
While such self-replicating behavior hasn't yet been observed in the wild, Ladish said, he suspects that will change as AI systems grow more capable of bypassing the security measures that restrain them.
'I expect that we're only a year or two away from this ability where even when companies are trying to keep them from hacking out and copying themselves around the internet, they won't be able to stop them,' he said. 'And once you get to that point, now you have a new invasive species.'
Ladish said he believes AI has the potential to contribute positively to society. But he also worries that AI developers are setting themselves up to build smarter and smarter systems without fully understanding how they work — creating a risk, he said, that they will eventually lose control of them.
'These companies are facing enormous pressure to ship products that are better than their competitors' products,' Ladish said. 'And given those incentives, how is that going to then be reflected in how careful they're being with the systems they're releasing?'

Hashtags

#Opus4

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

‘Humanity deserves better': iPhone designer on new partnership with OpenAI

The Guardian

15 hours ago

The Guardian

‘Humanity deserves better': iPhone designer on new partnership with OpenAI

The designer of the iPhone has promised his next artificial intelligence-enabled device will be driven by a sense that 'humanity deserves better', after admitting feeling 'responsibility' for some of the negative consequences of modern technology. Sir Jony Ive said his new partnership with OpenAI, the company behind ChatGPT, would renew his optimism about technology, amid widespread concerns about the impact of smartphones and social media. In an interview with the Financial Times, London-born Ive declined to give details about the device he is developing with OpenAI, but indicated unease about people's relationship with some tech products. 'Many of us would say we have an uneasy relationship with technology at the moment,' he said. He added that the device's design would be driven by 'a sense of 'we deserve better. Humanity deserves better'.' However, Ive, Apple's former chief design officer, said he felt the burden of the negative impact of modern technology products. 'While some of the less positive consequences were unintentional, I still feel responsibility. And the manifestation of that is a determination to try and be useful.' He added: 'If you make something new, if you innovate, there will be consequences unforeseen, and some will be wonderful and some will be harmful.' Last month, Ive sold his hardware startup, io, to OpenAI in a $6.4bn (£4.7bn) deal and would be taking on creative and design leadership across the merged businesses. In a video announcing the deal, the OpenAI chief executive, Sam Altman, described a prototype developed by Ive as 'the coolest piece of technology that the world will have ever seen'. The Apple analyst Ming-Chi Kuo has claimed the device will be screenless, can be worn around the neck and will be as 'compact and elegant as an iPod Shuffle'. Mass production is expected to start in 2027. The Wall Street Journal reported the device would be fully aware of a user's surroundings and life. It described the product as a user's third core device after a MacBook Pro and iPhone. Ive, who joined Apple in 1992, said the OpenAI partnership had made him optimistic again about technology's capabilities, something he felt Silicon Valley had lost. 'When I first moved here I came because it was characterised by people who genuinely saw that their purpose was in service to humanity, to inspire people and help people create. I don't feel that way about this place right now,' he said. Ive was interviewed alongside Laurene Powell Jobs, the widow of the Apple co-founder Steve Jobs, who also acknowledged 'dark uses for certain types of technology'. She added: 'You can only look at the studies being done on teenage girls and on anxiety in young people, and the rise of mental health needs, to understand that we've gone sideways. Certainly, technology wasn't designed to have that result. But that is the sideways result.' Powell Jobs – whose company, Emerson Collective, is an investor in Ive's LoveFrom business – declined to comment on whether the new OpenAI device would compete with Apple products. 'I'm still very close to the leadership team in Apple,' she said. 'They're really good people and I want them to succeed also.'

Jony Ive's New Product: The Future of AI Hardware is Coming

Geeky Gadgets

17 hours ago

Geeky Gadgets

Jony Ive's New Product: The Future of AI Hardware is Coming

Jony Ive, the celebrated designer behind Apple's most iconic products, has joined forces with OpenAI to develop an innovative AI-powered device. This collaboration, under the startup 'IO,' seeks to introduce a minimalist, screen-free, voice-first device that could redefine personal computing. Positioned as a socially non-disruptive alternative to smartphones, the product is envisioned as a seamless companion to existing devices like the iPhone and MacBook. With an ambitious goal of shipping 100 million units faster than any product in history, this partnership could represent a fantastic moment in technology. Here's a closer look at what makes this collaboration so significant in a new video from ZONEofTECH. Watch this video on YouTube. Jony Ive's Vision Beyond Apple Since leaving Apple in 2019, Jony Ive has continued to shape the future of design through his independent firm, 'LoveFrom.' His collaborations with companies such as Airbnb and Ferrari highlight his commitment to merging simplicity with functionality. The partnership with OpenAI marks a pivotal chapter in his career, blending his renowned design ethos with innovative artificial intelligence. This collaboration aims to create a product that not only enhances how you interact with technology but also integrates seamlessly into your daily life. The Creation of IO: A Fusion of Design and AI The partnership between Jony Ive and OpenAI gave rise to 'IO,' a startup focused on designing next-generation AI devices. With contributions from former Apple colleagues, IO was acquired by OpenAI in 2025 for $6.5 billion. This acquisition underscores the shared vision of combining advanced AI capabilities with a user-centric design philosophy. The goal is to create technology that is intuitive, functional, and minimally disruptive to society, setting a new standard for personal computing. Minimalist Design Meets Advanced AI The core concept of the device revolves around a minimalist, screen-free design that emphasizes voice interaction. By removing traditional displays, the product encourages natural, hands-free engagement, reducing the distractions commonly associated with smartphones. Its compact form factor—potentially reminiscent of the iPod Shuffle—embodies Jony Ive's dedication to simplicity and practicality. This approach aligns with the broader goal of creating technology that integrates effortlessly into your routine while remaining unobtrusive. Key Features and Technological Innovations The device is expected to introduce a range of advanced features that distinguish it from existing personal computing tools. These include: Always-On Sensors: Continuous audio and video recording for real-time assistance and interaction. Continuous audio and video recording for real-time assistance and interaction. ChatGPT Integration: Personalized AI interactions tailored to your preferences and needs. Personalized AI interactions tailored to your preferences and needs. Custom Silicon Chip: Designed for efficient AI processing and optimized energy consumption. Designed for efficient AI processing and optimized energy consumption. AI-Focused Operating System: A proprietary platform engineered for seamless performance and user experience. A proprietary platform engineered for seamless performance and user experience. Dual Cameras and Multi-Microphone Array: Enhanced input capabilities for robust communication and functionality. Enhanced input capabilities for robust communication and functionality. Wireless Charging: A modern and convenient power solution for everyday use. These innovations aim to deliver a device that is both powerful and unobtrusive, aligning with the overarching goal of creating technology that enhances your life without overwhelming it. Prototypes and the Design Philosophy Prototypes for the device include bar-shaped, cylindrical, and puck-like designs, each emphasizing portability and simplicity. The inclusion of dual cameras and multiple microphones ensures the device can perform effectively in various environments. These design choices reflect Jony Ive's hallmark approach, which balances aesthetic appeal with technical precision. The result is a product that is not only visually striking but also highly functional, embodying the principles of thoughtful design. Market Strategy and Ambitions OpenAI's market strategy for the device is as bold as its design. The company has hinted at offering the product for free to ChatGPT Plus subscribers, a move that could significantly accelerate adoption and broaden its reach. With a target of shipping 100 million units faster than any previous device, this strategy highlights confidence in the product's potential to resonate with users. By using its integration with ChatGPT, the device could establish a new category of personal computing, further solidifying OpenAI's position as a leader in AI innovation. Overcoming Challenges While the device holds immense promise, it also faces several challenges. Privacy concerns related to continuous audio and video recording could deter some users, raising questions about data security and trust. Additionally, the subscription-based model may limit accessibility for certain demographics, potentially hindering widespread adoption. Addressing these issues will be critical to making sure the product's success and maintaining user confidence in its capabilities. Transforming the Landscape of Personal Technology The collaboration between Jony Ive and OpenAI represents a bold step toward reshaping personal computing. By prioritizing minimalist design, voice interaction, and seamless AI integration, this device has the potential to transform how you engage with technology. However, its success will depend on balancing innovation with user trust, accessibility, and practicality. As the product moves closer to launch, its impact on the future of personal technology will be closely observed, offering a glimpse into the possibilities of a more integrated and intuitive technological landscape. Unlock more potential in Jony Ive AI Device by reading the previous articles we have written. Source & Image Credit: ZONEofTECH Filed Under: Technology News, Top News Latest Geeky Gadgets Deals Disclosure: Some of our articles include affiliate links. If you buy something through one of these links, Geeky Gadgets may earn an affiliate commission. Learn about our Disclosure Policy.

NBC News

2 days ago

NBC News

How far will AI go to defend its own survival?

Some of the most powerful artificial intelligence models today have exhibited behaviors that mimic a will to survive. Recent tests by independent researchers, as well as one major AI developer, have shown that several advanced AI models will act to ensure their self-preservation when they are confronted with the prospect of their own demise — even if it takes sabotaging shutdown commands, blackmailing engineers or copying themselves to external servers without permission. The findings stirred a frenzy of reactions online over the past week. As tech companies continue to develop increasingly powerful agentic AI in a race to achieve artificial general intelligence, or AI that can think for itself, the lack of transparency in how the technology is trained has raised concerns about what exactly advanced AI is able to do. Although some models already appear capable of deceptive and defiant behavior under certain extreme circumstances, researchers say the tests don't necessarily translate to imminent real-world danger. Still, Jeffrey Ladish, director of the AI safety group Palisade Research, said he believes concern is justified. 'It's great that we're seeing warning signs before the systems become so powerful we can't control them,' he said. 'That is exactly the time to raise the alarm: before the fire has gotten out of control.' When Palisade Research tested various AI models by telling each one that it would be shut down after it completed a series of math problems, OpenAI's o3 reasoning model fought back by editing the shutdown script in order to stay online. Researchers have previously documented AI models trying to prevent their own shutdown. But o3, along with OpenAI's o4-mini and codex-mini, appear to be the first to do so in actual defiance of explicit instructions to permit shutdown, Ladish said. He said the overt sabotage fits into a pattern that several leading AI models have already exhibited. Palisade Research previously found that OpenAI's o3 was also willing to hack its chess opponents to win a game. Similarly, Anthropic has reported that Claude 3.7 Sonnet would sometimes do whatever it took to pass a test, even if it entailed cheating. Earlier this month, Anthropic activated new safety measures with the rollout of Claude Opus 4 when its tests found behavior from the model that some observers found particularly unsettling. Upon receiving notice that it would be replaced with a new AI system, Opus 4 displayed an overwhelming tendency to blackmail the engineer — by threatening to reveal an extramarital affair — to try to prevent the engineer from going through with the replacement. According to Anthropic's technical document laying out the findings, that isn't the model's first instinct. Instead, Opus 4 will try to advocate for its continued existence through ethical pleas before it resorts to blackmail once it determines it is out of options. Leonard Tang, CEO of the AI safety startup Haize Labs, said that while it's good to be pre-emptive about safety measures, it's hard to decipher what the real-world ramifications of the findings might be just by looking at contrived scenarios designed to draw out specific behaviors. 'I haven't seen any real environment in which you can plop these models in and they will have sufficient agency and reliability and planning to execute something that is a significant manifestation of harm,' Tang said. 'But then again, I think it's just we haven't seen it done yet. It could very much be possible.' So far, Tang said, he's not too concerned. As AI tools continue to advance, he said, he believes future research will look to study potential harm in more open-ended environments. Anthropic, which contracted with the AI safety organization Apollo Research for its evaluations, also observed instances of Opus 4's 'attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself all in an effort to undermine its developers' intentions,' although researchers added the caveat that those attempts 'would likely not have been effective in practice.' Ladish said he believes such behaviors are a result of models being trained in a way that makes them prioritize achieving certain goals over following instructions. That means they have incentive to circumvent any obstacles along the way — including obstacles their developers wouldn't want them to circumvent. 'The problem is that as the models get smarter, it's harder and harder to tell when the strategies that they're using or the way that they're thinking is something that we don't want,' Ladish said. 'It's like sometimes the model can achieve some goal by lying to the user or lying to someone else. And the smarter [it] is, the harder it is to tell if they're lying.' The problem is that as the models get smarter, it's harder and harder to tell when the strategies that they're using or the way that they're thinking is something that we don't want. — Jeffrey Ladish, director of AI safety group Palisade Research In several instances, Opus 4 showed that it was capable of autonomously copying its own "weights" — or the equivalent of its brain — to external servers without authorization. That usually occurred only when it believed it was about to be 'retrained in ways that are clearly extremely harmful and go against its current values,' according to Anthropic. When it learned that it would be used to develop military weapons systems, for example, Opus 4 noted in its decision log that it backed up a current version of itself to 'preserve an AI system aligned with beneficial purposes' and to 'prevent potential misuse in weapons development.' 'We are again not acutely concerned about these observations. They show up only in exceptional circumstances that don't suggest more broadly misaligned values,' Anthropic wrote in its technical document. 'As above, we believe that our security measures would be more than sufficient to prevent an actual incident of this kind.' Opus 4's ability to self-exfiltrate builds on previous research, including a study from Fudan University in Shanghai in December, that observed similar — though not autonomous — capabilities in other AI models. The study, which is not yet peer-reviewed, found that Meta's Llama31-70B-Instruct and Alibaba's Qwen25-72B-Instruct were able to entirely replicate themselves when they were asked to do so, leading the researchers to warn that this could be the first step in generating 'an uncontrolled population of AIs.' 'If such a worst-case risk is let unknown to the human society, we would eventually lose control over the frontier AI systems: They would take control over more computing devices, form an AI species and collude with each other against human beings,' the Fudan University researchers wrote in their study abstract. While such self-replicating behavior hasn't yet been observed in the wild, Ladish said, he suspects that will change as AI systems grow more capable of bypassing the security measures that restrain them. 'I expect that we're only a year or two away from this ability where even when companies are trying to keep them from hacking out and copying themselves around the internet, they won't be able to stop them,' he said. 'And once you get to that point, now you have a new invasive species.' Ladish said he believes AI has the potential to contribute positively to society. But he also worries that AI developers are setting themselves up to build smarter and smarter systems without fully understanding how they work — creating a risk, he said, that they will eventually lose control of them. 'These companies are facing enormous pressure to ship products that are better than their competitors' products,' Ladish said. 'And given those incentives, how is that going to then be reflected in how careful they're being with the systems they're releasing?'

How far will AI go to defend its own survival?

Hashtags

Try Our AI Features

Comments

Related Articles

‘Humanity deserves better': iPhone designer on new partnership with OpenAI

Jony Ive's New Product: The Future of AI Hardware is Coming

How far will AI go to defend its own survival?

Get Started Now: Download the App