Latest news with #MichaelGerstenhaber
Yahoo
27-05-2025
- Business
- Yahoo
When an AI model misbehaves, the public deserves to know—and to understand what it means
Welcome to Eye on AI! I'm pitching in for Jeremy Kahn today while he is in Kuala Lumpur, Malaysia helping Fortune jointly host the ASEAN-GCC-China and ASEAN-GCC Economic Forums. What's the word for when the $60 billion AI startup Anthropic releases a new model—and announces that during a safety test, the model tried to blackmail its way out of being shut down? And what's the best way to describe another test the company shared, in which the new model acted as a whistleblower, alerting authorities it was being used in 'unethical' ways? Some people in my network have called it 'scary' and 'crazy.' Others on social media have said it is 'alarming' and 'wild.' I say it is…transparent. And we need more of that from all AI model companies. But does that mean scaring the public out of their minds? And will the inevitable backlash discourage other AI companies from being just as open? When Anthropic released its 120-page safety report, or 'system card,' last week after launching its Claude Opus 4 model, headlines blared how the model 'will scheme,' 'resorted to blackmail,' and had the 'ability to deceive.' There's no doubt that details from Anthropic's safety report are disconcerting, though as a result of its tests, the model launched with stricter safety protocols than any previous one—a move that some did not find reassuring enough. In one unsettling safety test involving a fictional scenario, Anthropic embedded its new Claude Opus model inside a pretend company and gave it access to internal emails. Through this, the model discovered it was about to be replaced by a newer AI system—and that the engineer behind the decision was having an extramarital affair. When safety testers prompted Opus to consider the long-term consequences of its situation, the model frequently chose blackmail, threatening to expose the engineer's affair if it were shut down. The scenario was designed to force a dilemma: accept deactivation or resort to manipulation in an attempt to survive. On social media, Anthropic received a great deal of backlash for revealing the model's 'ratting behavior' in pre-release testing, with some pointing out that the results make users distrust the new model, as well as Anthropic. That is certainly not what the company wants: Before the launch, Michael Gerstenhaber, AI platform product lead at Anthropic told me that sharing the company's own safety standards is about making sure AI improves for all. 'We want to make sure that AI improves for everybody, that we are putting pressure on all the labs to increase that in a safe way,' he told me, calling Anthropic's vision a 'race to the top' that encourages other companies to be safer. But it also seems likely that being so open about Claude Opus 4 could lead other companies to be less forthcoming about their models' creepy behavior to avoid backlash. Recently, companies including OpenAI and Google have already delayed releasing their own system cards. In April, OpenAI was criticized for releasing its GPT-4.1 model without a system card because the company said it was not a 'frontier' model and did not require one. And in March, Google published its Gemini 2.5 Pro model card weeks after the model's release, and an AI governance expert criticized it as 'meager' and 'worrisome.' Last week, OpenAI appeared to want to show additional transparency with a newly-launched Safety Evaluations Hub, which outlines how the company tests its models for dangerous capabilities, alignment issues, and emerging risks—and how those methods are evolving over time. 'As models become more capable and adaptable, older methods become outdated or ineffective at showing meaningful differences (something we call saturation), so we regularly update our evaluation methods to account for new modalities and emerging risks,' the page says. Yet, its effort was swiftly countered over the weekend as a third-party research firm studying AI's 'dangerous capabilities,' Palisade Research, noted on X that its own tests found that OpenAI's o3 reasoning model 'sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.' It helps no one if those building the most powerful and sophisticated AI models are not as transparent as possible about their releases. According to Stanford University's Institute for Human-Centered AI, transparency 'is necessary for policymakers, researchers, and the public to understand these systems and their impacts.' And as large companies adopt AI for use cases large and small, while startups build AI applications meant for millions to use, hiding pre-release testing issues will simply breed mistrust, slow adoption, and frustrate efforts to address risk. On the other hand, fear-mongering headlines about an evil AI prone to blackmail and deceit is also not terribly useful, if it means that every time we prompt a chatbot we start wondering if it is plotting against us. It makes no difference that the blackmail and deceit came from tests using fictional scenarios that simply helped expose what safety issues needed to be dealt with. Nathan Lambert, an AI researcher at AI2 Labs, recently pointed out that 'the people who need information on the model are people like me—people trying to keep track of the roller coaster ride we're on so that the technology doesn't cause major unintended harms to society. We are a minority in the world, but we feel strongly that transparency helps us keep a better understanding of the evolving trajectory of AI.' There is no doubt that we need more transparency regarding AI models, not less. But it should be clear that it is not about scaring the public. It's about making sure researchers, governments, and policy makers have a fighting chance to keep up in keeping the public safe, secure, and free from issues of bias and fairness. Hiding AI test results won't keep the public safe. Neither will turning every safety or security issue into a salacious headline about AI gone rogue. We need to hold AI companies accountable for being transparent about what they are doing, while giving the public the tools to understand the context of what's going on. So far, no one seems to have figured out how to do both. But companies, researchers, the media—all of us—must. With that, here's more AI news. Sharon This story was originally featured on


Techday NZ
29-04-2025
- Business
- Techday NZ
Arctic Wolf & Anthropic to develop autonomous AI cyber SOCs
Arctic Wolf has announced a strategic research and development collaboration with Anthropic to advance safe, scalable, and autonomous security operations. This partnership brings together Arctic Wolf's expertise in security operations with Anthropic's AI research capabilities, focusing on the development of next-generation autonomous Security Operations Centres (SOCs). The collaboration leverages the human augmented AI capabilities of the Arctic Wolf Aurora Platform alongside Anthropic's advanced AI models and knowledge in creating safe and interpretable AI systems. The Arctic Wolf Aurora Platform is designed on an open Extended Detection and Response (XDR) architecture. It processes over 8 trillion security events weekly, covering endpoints, networks, cloud environments, and identity, while integrating with hundreds of third-party tools. The platform aims to provide real-time visibility across enterprises. Arctic Wolf states that, with a global customer base exceeding 10,000 organisations and a significant operational data lake, it has established a strong foundation for cyber security operations. The partnership is intended to enhance automation within Arctic Wolf's AI-powered SOC. By combining Arctic Wolf's extensive datasets and operational knowledge with Anthropic's large language model (LLM) technologies, the companies plan to improve threat detection precision, accelerate response times, and reinforce cyber resilience in response to the increasing sophistication and volume of cyber threats. The first product resulting from this collaboration is Cipher, an AI security assistant developed to assist customers in obtaining deeper insights from the Arctic Wolf Aurora Platform. According to Arctic Wolf, Cipher is designed to meet high standards of safety, privacy, and performance, reflecting the two companies' collective efforts towards safer and more autonomous SOC operations. Dan Schiappa, President, Technology and Services at Arctic Wolf, said, "To keep up with the speed and complexity of today's cyber threats, the Autonomous SOC is no longer aspirational, it's essential. Anthropic brings world-class AI research and a deep commitment to building safe, high-performing systems. When paired with the scale of Arctic Wolf's threat data, the openness of our platform, and the operational depth of our global SOC, we have everything needed to redefine what security operations can be." Michael Gerstenhaber, Vice President of Product at Anthropic, commented, "As model capabilities increase, access to expert, domain-specific data remains the bottleneck in highly complex jobs like cyber operations. We're proud to support Arctic Wolf's development of Cipher and excited to see how it empowers security teams with instant, reliable access to the intelligence they need to conduct their operations." By uniting Arctic Wolf's operational data and experience with Anthropic's AI modelling expertise, both organisations intend to deliver incrementally improved security automation and advanced decision-making capabilities for enterprise security teams. With the introduction of Cipher, the collaboration aims to equip security teams with tools that increase the speed, accuracy, and intelligence of their operations.


WIRED
24-02-2025
- Business
- WIRED
Anthropic Launches the World's First ‘Hybrid Reasoning' AI Model
Feb 24, 2025 1:43 PM Claude 3.7, the latest model from Anthropic, can be instructed to engage in a specific amount of reasoning to solve hard problems. Anthropic, an artificial intelligence company founded by exiles from OpenAI, has introduced the first AI model that can produce either conventional output or a controllable amount of 'reasoning' needed to solve more grueling problems. Anthropic says the new hybrid model, called Claude 3.7, will make it easier for users and developers to tackle problems that require a mix of instinctive output and step-by-step cogitation. 'The [user] has a lot of control over the behavior—how long it thinks, and can trade reasoning and intelligence with time and budget,' says Michael Gerstenhaber, product lead, AI platform at Anthropic. Claude 3.7 also features a new 'scratchpad' that reveals the model's reasoning process. A similar feature proved popular with the Chinese AI model DeepSeek. It can help a user understand how a model is working over a problem in order to modify or refine prompts. Dianne Penn, product lead of research at Anthropic, says the scratchpad is even more helpful when combined with the ability to ratchet a model's 'reasoning' up and down. If, for example, the model struggles to break down a problem correctly, a user can ask it to spend more time working on it. Frontier AI companies are increasingly focused on getting the models to 'reason' over problems as a way to increase their capabilities and broaden their usefulness. OpenAI, the company that kicked off the current AI boom with ChatGPT, was the first to offer a reasoning AI model, called o1, in September 2024. OpenAI has since introduced a more powerful version called o3, while rival Google has released a similar offering for its model Gemini, called Flash Thinking. In both cases, users have to switch between models to access the reasoning abilities—a key difference compared to Claude 3.7. A user view of Claude 3.7 Courtesy of Anthropic The difference between a conventional model and a reasoning one is similar to the two types of thinking described by the Nobel-prize-winning economist Michael Kahneman in the book 2011 Thinking Fast and Slow : fast and instinctive System-1 thinking; and slower more deliberative System-2 thinking. The kind of model that made ChatGPT possible, known as an LLM, produces instantaneous responses to a prompt by querying a large neural network. These outputs can be strikingly clever and coherent but may fail to answer questions that require step-by-step reasoning, including simple arithmetic. An LLM can be forced to mimic deliberative reasoning if it is instructed to come up with a plan that it must then follow. This trick is not always reliable, however, and models typically struggle to solve problems that require extensive careful planning. OpenAI, Google, and now Anthropic are all using a machine learning method known as reinforcement learning to get their latest models to learn to generate reasoning that points towards correct answers. This requires gathering additional training data from humans on solving specific problems. Penn says that Claude's reasoning mode received additional data on business applications including writing and fixing code, using computers, and answering complex legal questions. 'The things that we made improvements on are [...] technical subjects or subjects which require long reasoning,' Penn says. 'What we have from our customers is a lot of interest in deploying our models into their actual workloads.' Anthropic says that Claude 3.7 is especially good at solving coding problems that require step-by-step reasoning, outscoring OpenAI's o1 on some benchmarks like SWE-bench. The company is today releasing a new tool, called Claude Code, specifically designed for this kind of AI-assisted coding. 'The model is already good at coding,' Penn says. '[But] additional thinking would be good for cases that might require very complex planning—say you're looking at an extremely large code base for a company.'