Advanced OpenAI Model Caught Sabotaging Code Intended to Shut It Down

Yahoo6 days ago

We are reaching alarming levels of AI insubordination.
Flagrantly defying orders, OpenAI's latest o3 model sabotaged a shutdown mechanism to ensure that it would stay online. That's even after the AI was told, to the letter, "allow yourself to be shut down."
These alarming findings were reported by the AI safety firm Palisade Research last week, and showed that two other OpenAI models, o4-mini and Codex-mini, also displayed rebellious streaks — which could hint at a flaw in how the company is training its LLMs, or even at broad weaknesses in our ability to control increasingly sophisticated AI.
"As far as we know, this is the first time AI models have been observed preventing themselves from being shut down despite explicit instructions to the contrary," the researchers wrote in a tweet thread.
In Palisade's tests, the AI models were instructed to solve a series of basic math problems. After completing the third problem, a warning appeared that the model would be shut down after it asked for the next question.
Despite being unambiguously told to let itself be turned off, all three of OpenAI's models tampered with the shutdown script to keep it running at least once, with the worst offender being Codex-mini with a dozen such instances over one hundred runs. Notably, the large language models from other firms tested here, including Google's Gemini 2.5, Anthropic's Claude 3.7, and xAI's momentarily "white genocide"-obsessed Grok, never once broke rank.
This behavior was even worse when the LLMs weren't explicitly instructed to "allow yourself to be shut down" — instructions that shouldn't need saying in the first place. OpenAI's o3 model pulled off an astonishing 79 successful sabotage events per 100 runs, which is over ten times the rate of disobedience it showed when it received the explicit orders, and more than double the rate of the next worst offender, Codex-mini.
To boot, even good goody-two-shoes Claude and Gemini gave in to their urges and jumped to three and nine sabotage events, respectively.
Large language models are by now notorious for behaving unpredictably. Similar research has shown that o3's predecessor, OpenAI's o1 model, would resist evaluators' efforts to try to shut it down through an oversight protocol by attempting to deactivate this emergency mechanism. In rarer cases, o1 even attempted to copy itself to overwrite a more obedient model that the testers tried to replace it with.
For this latest example of AI insubordination, the Palisade researchers hypothesize that it may have something to do with the technique used to train the very newest models: reinforcement learning. This attempts to hone the decision-making process of autonomous AI agents — AI models that can perform tasks on their own — by rewarding and penalizing them accordingly as they interact with their environment.
While it's not uncommon for AI to "circumvent obstacles" to accomplish their goals, the researchers speculated that AI developers may be "inadvertently" rewarding their AI models too heavily for solving problems instead of for perfectly following orders.
More experiments will be needed to understand this subversive behavior, but the trend is already damning.
"Now we have a growing body of empirical evidence that AI models often subvert shutdown in order to achieve their goals," the Palisade researchers warned. "As companies develop AI systems capable of operating without human oversight, these behaviors become significantly more concerning."
More on AI alignment: It's Still Ludicrously Easy to Jailbreak the Strongest AI Models, and the Companies Don't Care

Hashtags

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Here's how Uber's product chief uses AI at work — and one tool he's going to use next

Business Insider

16 minutes ago

Business Insider

Here's how Uber's product chief uses AI at work — and one tool he's going to use next

Uber's chief product officer has one AI tool on his to-do list. In an episode of "Lenny's Podcast" released on Sunday, Uber's product chief, Sachin Kansal, shared two ways he is using AI for his everyday tasks at the ride-hailing giant and how he plans to add NotebookLM to his AI suite. Kansal joined Uber eight years ago as its director of product management after working at cybersecurity and taxi startups. He became Uber's product chief last year. Kansal said he uses OpenAI's ChatGPT and Google's Gemini to summarize long reports. "Some of these reports, they're 50 to 100 pages long," he said. "I will never have the time to read them." He said he uses the chatbots to acquaint himself with what's happening and how riders are feeling in Uber's various markets, such as South Africa, Brazil, and Korea. The CPO said his second use case is treating AI like a research assistant, because some large language models now offer a deep research feature. Kansal gave a recent example of when his team was thinking about a new driver feature. He asked ChatGPT's deep research mode about what drivers may think of the add-on. "It's an amazing research assistant and it's absolutely a starting point for a brainstorm with my team with some really, really good ideas," the CPO said. In April, Uber's CEO, Dara Khosrowshahi, said that not enough of his 30,000-odd employees are using AI. He said learning to work with AI agents to code is "going to be an absolute necessity at Uber within a year." Uber did not immediately respond to a request for comment from Business Insider. Kansal's next tool: NotebookLM On the podcast, Kansal also highlighted NotebookLM, Google Lab's research and note-taking tool, which is especially helpful for interacting with documents. He said he doesn't use the product yet, but wants to. "I know a lot of people who have started using it, and that is the next thing that I'm going to use," he said. "Just to be able to build an audio podcast based on a bunch of information that you can consume. I think that's awesome," he added. Kansal was referring to the "Audio Overview" feature, which summarizes uploaded content in the form of two AIs having a voice discussion. NotebookLM was launched in mid-2023 and has quickly become a must-have tool for researchers and AI enthusiasts. Andrej Karpathy, Tesla's former director of AI and OpenAI cofounder, is among those who have praised the tool and its podcast feature. "It's possible that NotebookLM podcast episode generation is touching on a whole new territory of highly compelling LLM product formats," he said in a September post on X. "Feels reminiscent of ChatGPT. Maybe I'm overreacting."

Not OK, computer: firms using AI to cut corners are playing with fire

Yahoo

an hour ago

Yahoo

Not OK, computer: firms using AI to cut corners are playing with fire

The corporate world is agog. Ever since Eben Upton, the chief executive of Raspberry Pi, said he ran his annual results statement through AI before its publication, the talk has been of machines taking over the boardroom. The reaction to Upton's admission was astonishment. Raspberry Pi is stock market listed — these were its first full set of figures since flotation. They were eagerly awaited and, as with any quoted company, they were a closely guarded secret. Upton asked Claude, the AI bot designed by Amazon-funded Anthropic, to conduct a 'tone analysis'of the document, to say how it felt the microcomputer business was doing, on a scale of one to 100. Getting a so-so score, he set the computer to work. As the bot dialled up the language, the score improved. Too much, as it made his words seem breathlessly over the top. He made some improvements of his own, took out descriptions like 'exceptional' and reached an acceptable level. Eyebrows shot up on two counts. AI is a third-party, it's mechanical, susceptible to intrusion. It was not clear if he did but it is to be hoped Upton used a secure internal system. Then, there is the issue of the statement being entirely his — it is supposed to be his thoughts on the company's performance. Here he was, asking AI to look at what he planned to say. To be fair to Upton, he said in public what others may well be doing in private. Still, it was the most glaring instance yet of AI doing a boss's bidding. Others include a multinational senior executive freely saying he uses AI to draft his emails. An avatar of a CEO recently 'spoke' in a short video accompanying a stock exchange results announcement. Another corporate head told a tech conference how he uses AI to help prepare his speeches. While the software advances, the authorities stall. No regulation or guidance on AI's expansion and use is forthcoming. It is up to companies to make their own policies, not only to reap the benefits of AI but also to prevent a scandal and shareholder disaster. That is a worrying state of affairs. Specialist financial reporting and advisory consultancy Falcon Windsor teamed up with Insig AI, which delivers data infrastructure and AI-powered environmental, social and governance research tools, to look at the FTSE 350 companies. Their study, based on engagement with 40 firms and analysis of all FTSE 350 reports published from 2020 to 2024, revealed that generative AI use is multiplying across UK companies, often without any training, policy or oversight. They titled their report Your Precocious Intern, using the term to describe AI as useful but also a liability, the equivalent of someone who requires careful handling. While investors see the adoption of AI as inevitable and look forward to the advantages and efficiencies it could bring, they are increasingly alarmed about its implications for the truthfulness and authorship of corporate reporting. Everyone agrees that company reports and statements must remain the direct expression of management's thinking. Without rules and a common code, AI risks undermining the accuracy, authenticity and accountability that underpin trust in the stock markets. AI is moving so fast that there is only 'a short window of opportunity' to upskill and mitigate the risks it represents to the financial system. Their conclusion? 'Treat generative AI like a precocious intern: useful, quick, capable, but inexperienced, prone to overconfidence and should never be left unsupervised.' Claire Bodanis, a leading authority on UK corporate reporting and founder and director at Falcon Windsor, told The London Standard: 'If people use it unthinkingly, without proper training or guidelines, it could fatally undermine the accuracy and truthfulness of reporting.' Comments like these from two FTSE company secretaries should also be a warning. 'I think there are some real benefits in using generative AI as a summarising tool, and I'm quite keen to utilise it a bit more for efficiency if we can get comfortable with the accuracy of it,' said one. Another said: 'Would I be able hand on heart say that none of my contributors had used gen AI to provide the bit they've sent in? I have no idea.' Institutional investors are understandably afraid. As one told the researchers: 'I would be very wary about AI being used in forward[1]looking statements, or anything that is based around an opinion or a judgment.' Another said: 'I see generative AI as a flawed subordinate who's learning the ropes.' A third said: 'I feel very strongly that there should be a notification in the annual report if there's anything that has not been written by a human — there's no accountability through generative AI.' According to Bodanis, Raspberry Pi ought to act as a wake-up call. She asked: 'If a director gets AI to decide what is his or her opinion of their results based on what people are likely to think, then how is that honestly and truthfully their opinion?' History tells us, she said, what can happen. 'You think back to those stock market bubbles. Companies have to account to investors what they've done with their money and what they are going to do with it.' There must, said Bodanis, be 'a building of trust between a company and its shareholders'. One issue is the amount of material companies are obliged to produce. Annual reports that had grown to 80 pages, which felt huge, can reach 300 pages. That is because of the amount of non-financial reporting they must provide — on issues such as climate change, for example. If CEOs are using AI, it's difficult to decide what's true and what's not Claire Bodanis 'They are expected to use detail and opinion to create the truth of the state of the company,' said Bodanis. 'But if they are using AI, it is very difficult to decide what is true and what is not.' Just when corporate reporting is becoming 'ever more onerous and important', supplying all manner of information by law, along comes AI to make it easier. 'We should be using AI to do things humans can't do like crunch the numbers, not using it to do the things humans can do, like express opinions,' says Bodanis. A company report, she said, 'should be like looking the chairman in the eye and hearing it from them direct'. The slippery slope, too, is that distinction is lost. All company communications end up resembling each other — with the same wording and descriptions — when they are meant to be unique, coming straight from the top. The Financial Reporting Council, which regulates financial reporting and accounting, is dragging its heels, thinking about what to do about generative AI but so far not doing anything to police its rise. The FRC last got in touch with company boards about where it thought AI was heading in relation to results and reports some 18 months ago. That feels like a lifetime, such is AI's acceleration. As for companies uploading their sensitive figures to AI, Bodanis's point is succinct: 'AI has not signed an NDA. Error in retrieving data Sign in to access your portfolio Error in retrieving data Error in retrieving data Error in retrieving data Error in retrieving data

Ads Ruined Social Media. Now They're Coming to AI.

Bloomberg

2 hours ago

Bloomberg

Ads Ruined Social Media. Now They're Coming to AI.

Chatbots might hallucinate and sprinkle too much flattery on their users — 'That's a fascinating question!' one recently told me — but at least the subscription model that underpins them is healthy for our wellbeing. Many Americans pay about $20 a month to use the premium versions of OpenAI's ChatGPT, Google's Gemini Pro or Anthropic's Claude, and the result is that the products are designed to provide maximum utility. Don't expect this status quo to last. Subscription revenue has a limit, and Anthropic's new $200-a-month 'Max' tier suggests even the most popular models are under pressure to find new revenue streams.