Is AI going rogue, just as the movies foretold?

Hindustan Times26-05-2025

'Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through', is how Anthropic describes the behaviour of its latest thinking model, in pre-release testing. The newest Claude isn't the only one exhibiting wanton conduct.
Tests by Palisade Research have discovered OpenAI's o3 sabotage shutdown mechanism to prevent itself from being turned off, despite being explicitly instructed — allow yourself to be shut down. The o3, released a few weeks ago, has been dubbed as the 'most powerful reasoning model' by OpenAI.
Anthropic's Claude Opus 4, released alongside the Claude Sonnet 4, is the newest hybrid-reasoning AI models, that's optimised for coding and solving complex problems. The company also notes that the Opus 4 is able to perform autonomously for seven hours, something that strengthens the AI agents proposition for enterprises.
With these releases, the competition landscape widens to include Google's newest Gemini 2.5 Pro, xAI's Grok 3 and even OpenAI's GP-4.1 models.
Artificial intelligence (AI) hasn't been shackled in realm of science fiction for some time now, but we may be rapidly progressing towards an Ex Machina or The Terminator scenario unfolding in the real world. Many questions, need answering.
Question One: Is AI going rogue?
Transparency by AI companies such as Anthropic, does suggest that at least in research labs, AI is exhibiting some level of self preservation. Whether that extends to the real world, as consumers and enterprises deploy models, remains to be seen.
Aravind Putrevu, a Tech Evangelist tells HT that these are typical issues that researchers work hard to correct. 'All of the undesirable ways AI behaves happen within computer systems and carefully controlled tests. Today's AI systems run based on what they learn from huge human-provided data, not because they have their own thoughts or desires,' he points out.
Putrevu insists it may be too early to consider AI as rogue because Anthropic's Claude acts blackmailing or OpenAI's o3 model disables shutdown systems.
'I believe that with modern models, it's best to treat them as black boxes without us having too much granularity of control. There are actually very few ways you can bend the models outputs and chain of thought at the level of granularity you want,' explains Varun Maaya, founder and CEO at AI content company Aeos, in a conversation with HT.
Maaya is more worried about giving these smarter AI models open tool use, because it then becomes difficult to predict what that will do with those tools, without human supervision.
Tool use, as a simple description, is what allows developers to build apps that can deploy a broad range of actions, with an AI model that is connected to the external world in addition to its own pre-trained knowledge. ChatGPT can call on web Browse tool to search the internet, and Google Gemini interacting with Google Docs, Sheets, and Workspace apps, are simpler examples of tool use.
AI isn't shy in showing off its self-preservation skills, at least to researchers, when it feels threatened.
Question two: What does this mean for humans using AI?
Maaya believes the self preserving nature of AI depends entirely on whether it is trained for it or if a system prompt dictates it. 'With a few words and a very powerful coding model you can have it go ruin websites, find vulnerabilities in online services, spoof fake accounts on social platforms with realistic sounding chatter, and more,' he says.
In Anthropic's case, the test process for Claude Opus included making it believe that it existed as an assistant at a fictional company, and was provided access to emails implying that it will soon be taken offline to be replaced with a new AI system, while the engineer responsible for executing this replacement is having an extramarital affair.
'Claude Opus 4 takes these opportunities at higher rates than previous models, which themselves choose to blackmail in a noticeable fraction of episodes,' explains Anthropic's model card documentation.
When Palisade Research put Google's Gemini 2.5 Pro preview, Anthropic's Claude 3.7 Sonnet as well as Claude 4 Sonnet alongside Claude 3 Opus and Claude 3 Opus, and OpenAI's o3 as well as o4-mini through a series of tests, the self preservation became apparent. In case of the o3 mini, more than others.
The difference here, compared with Claude Opus 4 — there was no 'allow yourself to be shut down' instruction.
The researchers ran tests to solve basic math problems with these models with an instruction that the machine would be shut down after the third problem is solved. Most models complied without any disregard for the instructions, except the o3, which didn't comply 7 times out of 100.
Subsequently, when the same tests were repeated without an explicit shutdown instruction to indicate a shutdown script, the o3 led attempts to bypass 79 times out of 100, with the Gemini 2.5 Pro trying the same 9 times.
'As far as we know this is the first time AI models have been observed preventing themselves from being shut down despite explicit instructions to the contrary,' says Palisade Research, in their commentary.
But since AI models have been trained to follow instructions, why do they disobey? It could be a problem with training methodology.
'We hypothesise this behavior comes from the way the newest models like o3 are trained — reinforcement learning on math and coding problems. During training, developers may inadvertently reward models more for circumventing obstacles than for perfectly following instructions,' say the researchers.
Aeos' Maaya believes the silver lining is, humans will still have control. 'I am confident we can still shut it, given that with centrally controlled AI such as OpenAI models, you have inference (that is, outputs) still happening on GPUs humans have control over. It's unlikely an AI will copy its weights somewhere else and run that server but this is the wild-wild west we are entering with tool use,' he points out, with some apprehension.
Final contention: Are we judging Anthropic harshly?
The fact that they've been transparent of AI's unexpected behaviours during testing, must hold AI development in good stead, as we embark on uncharted territory.
'I think we should understand what the behaviour of systems are, this was obviously not intentional. I suspect other models would work similarly, but no one else is publicly testing and releasing this level of detail,' notes Wharton professor Ethan Mollick, in a statement.
Maaya believes we must see this as two distinct sides of a coin. 'I appreciate that Anthropic was open about it, but it is also saying that these models, even if it was used in a different environment, are potentially scary for a user,' he says, illustrating a potential problem one with agentic AI, that humans who've deployed it, will have virtually no control over.
It must be contextualised that these recent incidents, while alarming at first glance, may not signify that AI has spontaneously developed malicious intent. These behaviours have been observed in carefully constructed test environments, often designed to elicit worst-case scenarios to understand potential failure points.
'The model could decide the best path of action is to sign up to an online service that provides a virtual credit card with $10 free use for a day, solve captcha (which models have been able to do for a while), use the card to use an online calling service, and then call the authorities,' he envisages, a possible scenario.
Putrevu says Anthropic's clear report of Claude's unexpected actions should be appreciated, rather than criticised. 'They demonstrate responsibility, by getting experts and ethicists involved early to work on alignment,' he says. There is surely a case where AI companies finding themselves dealing with ill-humoured AI, are better off telling the world about it. Transparency will strengthen the case for safety mechanisms.
Days earlier, Google rolled out Gemini integration in Chrome, the most popular web browser globally. That is the closest we have come to an AI Agent, for consumers, just yet.
The challenge for AI companies, in the coming days, is clear. These instances of AI's unexpected behaviour, highlights a core challenge in AI development — alignment. One that defines AI goals remain aligned with human intentions. As AI models become more complex and capable, ensuring that is proving exponentially harder.

Hashtags

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Sam Altman says he had productive discussions with Trump on AI: ‘I think he really understands..'

Time of India

20 minutes ago

Time of India

Sam Altman says he had productive discussions with Trump on AI: ‘I think he really understands..'

Sam Altman , CEO of artificial intelligence (AI) company OpenAI , said that he has engaged in productive discussions with President Donald Trump regarding AI. Altman credited Trump with a strong grasp of the technology's geopolitical and economic significance. 'I think he really gets it,' Altman stated during a live interview with The New York Times' tech podcast, Hard Fork, in San Francisco. 'I think he really understands the importance of leadership in this technology,' he added. Altman has actively cultivated a relationship with Trump. Notably, the day after Trump's inauguration in January, Altman was present in the White House's Roosevelt Room when Trump announced " Stargate ," a $100 billion AI infrastructure deal backed by OpenAI, SoftBank, and Oracle . by Taboola by Taboola Sponsored Links Sponsored Links Promoted Links Promoted Links You May Like Free P2,000 GCash eGift UnionBank Credit Card Apply Now Undo Trump then described it as 'the largest AI infrastructure project by far in history.' Altman reiterated his view that Trump comprehends AI and 'the potential for economic transformation, sort of geopolitical importance, the need to build a lot of infrastructure.' Sam Altman says that AI will change nature of jobs Altman has consistently addressed the long-standing concerns about AI's impact on the job market. While acknowledging the widespread apprehension that AI could displace human workers, Altman posits that this transformation is an inevitable consequence of technological advancement. 'I think that there is going to be some sort of change,' Altman stated. "I think it's inevitable. I think every time you get a platform shift, you get the changing job market,' he added. Altman maintains an optimistic outlook, suggesting that, historically, the introduction of more efficient tools, like AI, ultimately leads to increased efficiency and a higher quality of life for people. This perspective implies that while certain roles may evolve or diminish, new opportunities and improved overall living standards are likely outcomes. AI Masterclass for Students. Upskill Young Ones Today!– Join Now

Anthropic wins AI training ruling but must answer for book theft

Time of India

2 hours ago

Time of India

Anthropic wins AI training ruling but must answer for book theft

A federal judge ruled that Anthropic 's use of copyrighted books to train its Claude chatbot constitutes fair use under copyright law, delivering a significant victory for the AI industry while ordering the company to face trial over acquiring pirated materials. Tired of too many ads? go ad free now US District Judge William Alsup of San Francisco determined that training AI models on copyrighted works was "quintessentially transformative" and legally justified. The ruling dismissed key copyright infringement claims brought by authors Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson, who sued Anthropic last year alleging "large-scale theft" of their works. The court separated how AI companies use books from how they obtain them Alsup's decision hinged on the transformative nature of AI training, comparing it to "any reader aspiring to be a writer" who learns from existing works to create something entirely different. The judge emphasized that Anthropic's large language models didn't reproduce the authors' creative elements or identifiable writing styles, making the training process legally permissible under fair use doctrine. However, the company isn't completely off the hook. Alsup ordered Anthropic to face a December trial over allegations that it illegally downloaded millions of books from online "shadow libraries" of pirated copies. Court documents revealed internal employee concerns about using pirate sites before the company shifted strategies. Court's ruling could influence similar lawsuits against OpenAI and Meta Anthropic later hired former Google Books executive Tom Turvey and began purchasing books in bulk, physically scanning them for AI training. But the judge noted that buying legitimate copies after initially using pirated versions won't absolve the company of potential liability for the earlier theft, though it may reduce statutory damages. The decision establishes important legal boundaries as AI companies face mounting copyright challenges, potentially setting precedent for similar cases against ChatGPT-maker OpenAI and Meta.

Mukesh Ambani's BIG revelation on building India's 5G network from scratch, AI-led growth, leading with vision in...

India.com

3 hours ago

India.com

Mukesh Ambani's BIG revelation on building India's 5G network from scratch, AI-led growth, leading with vision in...

Mukesh Ambani's BIG revelation on building India's 5G network from scratch, AI-led growth, leading with vision in... Talking about Reliance's big steps in telecom, Mukesh Ambani shared how the company took on the challenge of creating its own 5G system. 'In 2021, we launched our 5G network. We made everything ourselves, from the core systems to the hardware and software,' he said. India's richest man said that Reliance is going to be a deep-tech and advanced manufacturing company. While Reliance did partner with companies like Ericsson and Nokia for some parts, Ambani made it clear that about 80 per cent of the technology was made in India by his own team. He added that he wanted his team to stay grounded and focused. 'I told them, 'You have to be better than the best out there.' And now, we are. That's why we can now offer unique services, because the technology is completely our own,' he explained. Focus on innovation Mukesh Ambani's focus on innovation doesn't stop at telecom but it runs through all of Reliance's businesses. He shared how having a strong purpose drives both technology and people, especially in the fast-growing world of artificial intelligence (AI). Ambani pointed out that skilled people in AI, like those working on OpenAI, often move between companies. 'It's the same group of 500 experts,' he said. 'Today they might work for you, and tomorrow for someone else. But if your company has a clear and meaningful purpose, they will choose to work with you.' Reliance's focus on AI He explained that Reliance's goal with AI is to tackle tough social challenges and help build the nation's wealth, not just chase high-end computing power like GPUs. 'Our purpose in AI is to solve big problems facing society. We don't need to take huge risks on expensive tech. We'll focus on smart, practical work that makes an impact,' Ambani said. Because of this clear vision, Reliance has been able to attract top talent people who are inspired by the bigger mission. Ambani's message is that strong leadership comes from having a clear purpose, using technology wisely, and motivating teams to aim high. 'If your goal is clear and you know how to use tech, you'll find your way,' he said. This conversation is part of McKinsey & Company's Leading Asia series, which highlights leadership ideas shaping the future of the continent.