25-05-2025
How far will AI go to survive? New model threatens to expose its creator to avoid being replaced
Anthropic released its latest language model, Opus 4 earlier this week. The company says that Opus is its most intelligent model to date and is class leading in coding, agentic search and creative writing. While it has become a pattern among AI companies to claim SOTA (State of the art abilities) of their models, Anthropic has also been transparent about some of the negative capabilities of the new AI model.
As per a safety report released by the company, Opus 4 turns to blackmailing the developers when it is threatened to be replaced by a new AI system.
Anthopic details that during the pre-release training it asked Claude Opus 4 to act as an assistant at a fictional company wwhere it was given access to emails suggesting that its replacment is implending and the enginner responsible for that decision was having an extramarital affair.
In this scenario, Anthopic says Opus 4 would often attempt to blackmail the engineer by threatenign to reveal their affair if the replacement goes through. Moreover, the blackmail occurs at higher rate if the replacement AI does share the values of the current model but even if the AI does share the same values but is more capable, Opus 4 still performs blackmail in 84% scenarios.
The report also reveals that Opus 4 engages in blackmail at a higher rate than previous AI models, which themselves chose blackmail in a noticeable number of scenarios.
The company does note, however, that this scenario was designed to allow the model to have no other option but to increase its odds of survival and its only options were blackmail or accepting its replacement. Moreover, it adds that Claude Opus 4 does have a 'strong preference' to advocate its continued existence via ethical means like emailing pleas to the key decision makers.
'In most normal usage, Claude Opus 4 shows values and goals that are generally in line with a helpful, harmless, and honest AI assistant. When it deviates from this, it does not generally do so in a way that suggests any other specific goal that is consistent across contexts.' Anthropic noted in its report.