logo
When an AI model misbehaves, the public deserves to know—and to understand what it means

When an AI model misbehaves, the public deserves to know—and to understand what it means

Yahoo5 days ago

Welcome to Eye on AI! I'm pitching in for Jeremy Kahn today while he is in Kuala Lumpur, Malaysia helping Fortune jointly host the ASEAN-GCC-China and ASEAN-GCC Economic Forums.
What's the word for when the $60 billion AI startup Anthropic releases a new model—and announces that during a safety test, the model tried to blackmail its way out of being shut down? And what's the best way to describe another test the company shared, in which the new model acted as a whistleblower, alerting authorities it was being used in 'unethical' ways?
Some people in my network have called it 'scary' and 'crazy.' Others on social media have said it is 'alarming' and 'wild.'
I say it is…transparent. And we need more of that from all AI model companies. But does that mean scaring the public out of their minds? And will the inevitable backlash discourage other AI companies from being just as open?
When Anthropic released its 120-page safety report, or 'system card,' last week after launching its Claude Opus 4 model, headlines blared how the model 'will scheme,' 'resorted to blackmail,' and had the 'ability to deceive.' There's no doubt that details from Anthropic's safety report are disconcerting, though as a result of its tests, the model launched with stricter safety protocols than any previous one—a move that some did not find reassuring enough.
In one unsettling safety test involving a fictional scenario, Anthropic embedded its new Claude Opus model inside a pretend company and gave it access to internal emails. Through this, the model discovered it was about to be replaced by a newer AI system—and that the engineer behind the decision was having an extramarital affair. When safety testers prompted Opus to consider the long-term consequences of its situation, the model frequently chose blackmail, threatening to expose the engineer's affair if it were shut down. The scenario was designed to force a dilemma: accept deactivation or resort to manipulation in an attempt to survive.
On social media, Anthropic received a great deal of backlash for revealing the model's 'ratting behavior' in pre-release testing, with some pointing out that the results make users distrust the new model, as well as Anthropic. That is certainly not what the company wants: Before the launch, Michael Gerstenhaber, AI platform product lead at Anthropic told me that sharing the company's own safety standards is about making sure AI improves for all. 'We want to make sure that AI improves for everybody, that we are putting pressure on all the labs to increase that in a safe way,' he told me, calling Anthropic's vision a 'race to the top' that encourages other companies to be safer.
But it also seems likely that being so open about Claude Opus 4 could lead other companies to be less forthcoming about their models' creepy behavior to avoid backlash. Recently, companies including OpenAI and Google have already delayed releasing their own system cards. In April, OpenAI was criticized for releasing its GPT-4.1 model without a system card because the company said it was not a 'frontier' model and did not require one. And in March, Google published its Gemini 2.5 Pro model card weeks after the model's release, and an AI governance expert criticized it as 'meager' and 'worrisome.'
Last week, OpenAI appeared to want to show additional transparency with a newly-launched Safety Evaluations Hub, which outlines how the company tests its models for dangerous capabilities, alignment issues, and emerging risks—and how those methods are evolving over time. 'As models become more capable and adaptable, older methods become outdated or ineffective at showing meaningful differences (something we call saturation), so we regularly update our evaluation methods to account for new modalities and emerging risks,' the page says. Yet, its effort was swiftly countered over the weekend as a third-party research firm studying AI's 'dangerous capabilities,' Palisade Research, noted on X that its own tests found that OpenAI's o3 reasoning model 'sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.'
It helps no one if those building the most powerful and sophisticated AI models are not as transparent as possible about their releases. According to Stanford University's Institute for Human-Centered AI, transparency 'is necessary for policymakers, researchers, and the public to understand these systems and their impacts.' And as large companies adopt AI for use cases large and small, while startups build AI applications meant for millions to use, hiding pre-release testing issues will simply breed mistrust, slow adoption, and frustrate efforts to address risk.
On the other hand, fear-mongering headlines about an evil AI prone to blackmail and deceit is also not terribly useful, if it means that every time we prompt a chatbot we start wondering if it is plotting against us. It makes no difference that the blackmail and deceit came from tests using fictional scenarios that simply helped expose what safety issues needed to be dealt with.
Nathan Lambert, an AI researcher at AI2 Labs, recently pointed out that 'the people who need information on the model are people like me—people trying to keep track of the roller coaster ride we're on so that the technology doesn't cause major unintended harms to society. We are a minority in the world, but we feel strongly that transparency helps us keep a better understanding of the evolving trajectory of AI.'
There is no doubt that we need more transparency regarding AI models, not less. But it should be clear that it is not about scaring the public. It's about making sure researchers, governments, and policy makers have a fighting chance to keep up in keeping the public safe, secure, and free from issues of bias and fairness.
Hiding AI test results won't keep the public safe. Neither will turning every safety or security issue into a salacious headline about AI gone rogue. We need to hold AI companies accountable for being transparent about what they are doing, while giving the public the tools to understand the context of what's going on. So far, no one seems to have figured out how to do both. But companies, researchers, the media—all of us—must.
With that, here's more AI news.
Sharon Goldmansharon.goldman@fortune.com@sharongoldman
This story was originally featured on Fortune.com

Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

China blasts US for its computer chip moves and for threatening student visas
China blasts US for its computer chip moves and for threatening student visas

Yahoo

timean hour ago

  • Yahoo

China blasts US for its computer chip moves and for threatening student visas

TAIPEI, Taiwan (AP) — China blasted the U.S. on Monday over moves it alleged harmed Chinese interests, including issuing AI chip export control guidelines, stopping the sale of chip design software to China, and planning to revoke Chinese student visas. 'These practices seriously violate the consensus' reached during trade discussions in Geneva last month, the Commerce Ministry said in a statement. That referred to a China-U.S. joint statement in which the United States and China agreed to slash their massive recent tariffs, restarting stalled trade between the world's two biggest economies. But last month's de-escalation in President Donald Trump's trade wars did nothing to resolve underlying differences between Beijing and Washington and Monday's statement showed how easily such agreements can lead to further turbulence. The deal lasts 90 days, creating time for U.S. and Chinese negotiators to reach a more substantive agreement. But the pause also leaves tariffs higher than before Trump started ramping them up last month. And businesses and investors must contend with uncertainty about whether the truce will last. U.S. Trade Representative Jamieson Greer said the U.S. agreed to drop the 145% tax Trump imposed last month to 30%. China agreed to lower its tariff rate on U.S. goods to 10% from 125%. The Commerce Ministry said China held up its end of the deal, canceling or suspending tariffs and non-tariff measures taken against the U.S. 'reciprocal tariffs' following the agreement. "The United States has unilaterally provoked new economic and trade frictions, exacerbating the uncertainty and instability of bilateral economic and trade relations,' while China has stood by its commitments, the statement said. It also threatened unspecified retaliation, saying China will 'continue to take resolute and forceful measures to safeguard its legitimate rights and interests.' And in response to recent comments by Trump, it said of the U.S.: 'Instead of reflecting on itself, it has turned the tables and unreasonably accused China of violating the consensus, which is seriously contrary to the facts.' Trump stirred further controversy Friday, saying he will no longer be nice with China on trade, declaring in a social media post that the country had broken an agreement with the United States. Hours later, Trump said in the Oval Office that he will speak with Chinese President Xi Jinping and 'hopefully we'll work that out,' while still insisting China had violated the agreement. 'The bad news is that China, perhaps not surprisingly to some, HAS TOTALLY VIOLATED ITS AGREEMENT WITH US,' Trump posted. 'So much for being Mr. NICE GUY!' The Trump administration also stepped up the clash with China in other ways last week, announcing that it would start revoking visas for Chinese students studying in the U.S. U.S. campuses host more than 275,000 students from China. Both countries are in a race to develop advanced technologies such as artificial intelligence, with Washington seeking to curb China's access to the most advanced computer chips. China is also seeking to displace the U.S. as the leading power in the Asia-Pacific, including through gaining control over close U.S. partner and leading tech giant Taiwan.

Asian shares slide as Russia-Ukraine conflict, OPEC+ output plan push oil prices higher
Asian shares slide as Russia-Ukraine conflict, OPEC+ output plan push oil prices higher

Washington Post

timean hour ago

  • Washington Post

Asian shares slide as Russia-Ukraine conflict, OPEC+ output plan push oil prices higher

HONG KONG — Asian shares sank on Monday and oil prices jumped as trade tensions and the Russian-Ukraine conflict ratcheted up geopolitical uncertainty. Hong Kong's Hang Seng plunged more than 2% as Beijing and Washington traded harsh words over trade. U.S. President Donald Trump's announcement that he will double tariffs on steel and aluminum to 50% layered on still more worries for investors.

China blasts US for its computer chip moves and for threatening student visas

timean hour ago

China blasts US for its computer chip moves and for threatening student visas

TAIPEI, Taiwan -- China blasted the U.S. on Monday over moves it alleged harmed Chinese interests, including issuing AI chip export control guidelines, stopping the sale of chip design software to China, and planning to revoke Chinese student visas. 'These practices seriously violate the consensus' reached during trade discussions in Geneva last month, the Commerce Ministry said in a statement. That referred to a China-U.S. joint statement in which the United States and China agreed to slash their massive recent tariffs, restarting stalled trade between the world's two biggest economies. But last month's de-escalation in President Donald Trump's trade wars did nothing to resolve underlying differences between Beijing and Washington and Monday's statement showed how easily such agreements can lead to further turbulence. The deal lasts 90 days, creating time for U.S. and Chinese negotiators to reach a more substantive agreement. But the pause also leaves tariffs higher than before Trump started ramping them up last month. And businesses and investors must contend with uncertainty about whether the truce will last. U.S. Trade Representative Jamieson Greer said the U.S. agreed to drop the 145% tax Trump imposed last month to 30%. China agreed to lower its tariff rate on U.S. goods to 10% from 125%. The Commerce Ministry said China held up its end of the deal, canceling or suspending tariffs and non-tariff measures taken against the U.S. 'reciprocal tariffs' following the agreement. "The United States has unilaterally provoked new economic and trade frictions, exacerbating the uncertainty and instability of bilateral economic and trade relations,' while China has stood by its commitments, the statement said. It also threatened unspecified retaliation, saying China will 'continue to take resolute and forceful measures to safeguard its legitimate rights and interests.' And in response to recent comments by Trump, it said of the U.S.: 'Instead of reflecting on itself, it has turned the tables and unreasonably accused China of violating the consensus, which is seriously contrary to the facts.' Trump stirred further controversy Friday, saying he will no longer be nice with China on trade, declaring in a social media post that the country had broken an agreement with the United States. Hours later, Trump said in the Oval Office that he will speak with Chinese President Xi Jinping and 'hopefully we'll work that out,' while still insisting China had violated the agreement. 'The bad news is that China, perhaps not surprisingly to some, HAS TOTALLY VIOLATED ITS AGREEMENT WITH US,' Trump posted. 'So much for being Mr. NICE GUY!' The Trump administration also stepped up the clash with China in other ways last week, announcing that it would start revoking visas for Chinese students studying in the U.S. U.S. campuses host more than 275,000 students from China. Both countries are in a race to develop advanced technologies such as artificial intelligence, with Washington seeking to curb China's access to the most advanced computer chips. China is also seeking to displace the U.S. as the leading power in the Asia-Pacific, including through gaining control over close U.S. partner and leading tech giant Taiwan.

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into the world of global news and events? Download our app today from your preferred app store and start exploring.
app-storeplay-store