OpenAI, Google, and Meta Researchers Warn We May Lose the Ability to Track AI Misbehavior

13 hours ago

Over 40 scientists from the world's leading AI institutions, including OpenAI, Google DeepMind, Anthropic, and Meta, have come together to call for more research in a particular type of safety monitoring that allows humans to analyze how AI models 'think.'
The scientists published a research paper on Tuesday that highlighted what is known as chain of thought (CoT) monitoring as a new yet fragile opportunity to boost AI safety. The paper was endorsed by prominent AI figures like OpenAI co-founders John Schulman and Ilya Sutskever as well as Nobel Prize laureate known as the 'Godfather of AI,' Geoffrey Hinton.
In the paper, the scientists explained how modern reasoning models like ChatGPT are trained to 'perform extended reasoning in CoT before taking actions or producing final outputs.' In other words, they 'think out loud' through problems step by step, providing them a form of working memory for solving complex tasks.
'AI systems that 'think' in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave,' the paper's authors wrote.
The researchers argue that CoT monitoring can help researchers detect when models begin to exploit flaws in their training, manipulate data, or fall victim to malicious user manipulation. Any issues that are found can then either be 'blocked, or replaced with safer actions, or reviewed in more depth.'
OpenAI researchers have already used this technique in testing to find cases when AI models have had the phrase 'Let's Hack' in their CoT.
Current AI models perform this thinking in human language, but the researchers warn that this may not always be the case.
As developers rely more on reinforcement learning, which prioritizes correct outputs rather than how they arrived at them, future models may evolve away from using reasoning that humans can't easily understand. Additionally, advanced models might eventually learn to suppress or obscure their reasoning if they detect that it's being monitored.
In response, the researchers are urging AI developers to track and evaluate the CoT monitorability of their models and to treat this as a critical component of overall model safety. They even recommend that it become a key consideration when training and deploying new models.

Hashtags

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

TSMC Raises 2025 Outlook in Big Boost for AI Demand Hopes

Yahoo

16 minutes ago

Yahoo

TSMC Raises 2025 Outlook in Big Boost for AI Demand Hopes

(Bloomberg) -- Taiwan Semiconductor Manufacturing Co. raised its outlook for 2025 revenue growth, shoring up investors' confidence in the momentum of the global AI spending spree. The Dutch Intersection Is Coming to Save Your Life Advocates Fear US Agents Are Using 'Wellness Checks' on Children as a Prelude to Arrests LA Homelessness Drops for Second Year Manhattan, Chicago Murder Rates Drop in 2025, Officials Say The world's biggest contract chipmaker on Thursday forecast sales growth of about 30% in US dollar terms this year, up from mid-20% previously. That reinforced expectations that tech firms from Meta Platforms Inc. to Google will keep spending to build the datacenters essential to artificial intelligence development. Nasdaq stock index futures swung to gains. TSMC's move underscores resilient demand for high-end chips from the likes of Nvidia Corp. and Advanced Micro Devices Inc., which is outpacing its production capacity. Chief Executive Officer C.C. Wei affirmed during a shareholder meeting in June that AI orders continue to run hot — seeking to dispel persistent speculation that tech firms may curtail spending. This is 'supporting the AI value chain, and AI optimism still holds,' said Billy Leung, investment strategist at Global X ETFs in Sydney. 'For investors, TSMC results again ease fears of an AI slowdown. Margins hold, demand outlook good, generally reinforces the AI buildout is still well underway.' Investors have piled back into AI-linked companies, shaking off a funk that settled in after China's DeepSeek cast doubt on whether the likes of Inc. needed to spend that much money on data centers. Last week, Nvidia became the first company in history to hit a $4 trillion valuation, underscoring investors' renewed enthusiasm for companies like TSMC that are key to building the infrastructure for AI. TSMC wasn't hiking its outlook on news the US is prepared to grant Nvidia licenses to export its H20 AI chip to China, Wei told reporters. While that resumption in sales was positive for the industry, it was too early to quantify the impact, he added. A day before TSMC's results, chipmaking gear supplier ASML Holding NV triggered anxiety across markets by walking back its own growth forecast for 2026. Geopolitics and the global economy are sources of 'increasing uncertainty,' Chief Executive Officer Christophe Fouquet said. Its shares dropped more than 11%. Wei on Thursday acknowledged the uncertainty stemming from the Trump administration's tariffs-led assault. The appreciating Taiwanese dollar is also suppressing its financials. 'Looking ahead to the second half of the year, we have not seen any change in our customers' behavior so far,' he said in Taipei. 'However, we understand there are uncertainties and risks' related to potential tariffs. TSMC upgraded its forecast after reporting a better-than-expected 61% jump in net income for the June quarter to NT$398.3 billion ($13.5 billion), keeping intact a streak of beating estimates every quarter since 2021. The company previously posted a 39% surge in revenue. Revenue from high-performance computing — which includes chips for servers and datacenters — now accounts for three-fifths of the company's revenue, a major change from when TSMC primarily rode the smartphone market. It remains the main chipmaker to Apple Inc. The company is sticking with plans to spend $38 billion to $42 billion upgrading and expanding capacity this year. TSMC had earlier pledged to spend another $100 billion ramping up manufacturing in Arizona, Japan, Germany and back home in Taiwan. --With assistance from Dasha Afanasieva, Gao Yuan, Winnie Hsu, Vlad Savov and Cindy Wang. (Updates with market action, commentary from the second paragraph.) How Starbucks' CEO Plans to Tame the Rush-Hour Free-for-All Forget DOGE. Musk Is Suddenly All In on AI How Hims Became the King of Knockoff Weight-Loss Drugs The Quest for a Hangover-Free Buzz Thailand's Changing Cannabis Rules Leave Farmers in a Tough Spot ©2025 Bloomberg L.P. Fehler beim Abrufen der Daten Melden Sie sich an, um Ihr Portfolio aufzurufen. Fehler beim Abrufen der Daten Fehler beim Abrufen der Daten Fehler beim Abrufen der Daten Fehler beim Abrufen der Daten

OYO's Ritesh Agarwal Invests in Culture Circle at Over INR 100 Cr Valuation

Entrepreneur

17 minutes ago

Entrepreneur

OYO's Ritesh Agarwal Invests in Culture Circle at Over INR 100 Cr Valuation

With over one million monthly users and more than 4,000 verified sellers, Culture Circle offers authenticated luxury and streetwear fashion. You're reading Entrepreneur India, an international franchise of Entrepreneur Media. Ritesh Agarwal, founder and chief executive of OYO, has invested in Culture Circle, a fashion commerce platform, at a valuation exceeding INR 100 crore. The funds will be used to enhance the company's artificial intelligence capabilities, expand its product categories, and grow its presence in global markets. Culture Circle was co-founded by Devansh Jain Nawal and Ackshay Jain. The startup had previously raised INR 3 crore from Agarwal and Kunal Bahl on the television show Shark Tank India, turning down a larger offer to preserve equity. Jain Nawal, an alumnus of IIM Ahmedabad and former Goldman Sachs employee, leads the company alongside Jain, who has worked at Google and currently convenes the JIIF Gurugram chapter. With over one million monthly users and more than 4,000 verified sellers, Culture Circle offers authenticated luxury and streetwear fashion. The platform uses AI-powered tools to verify products, compare prices in real-time, and connect with trusted global sellers. It currently operates flagship stores in Delhi and Hyderabad and plans to open new outlets in Gurugram, Mumbai, and other cities. "Culture Circle is one of the most exciting youth-first platforms to emerge from India," said Ritesh Agarwal. "Their focus on trust and experience makes them truly stand out." Devansh Jain Nawal stated, "This is more than funding, it's a partnership built on shared values." Ackshay Jain added, "We're building a cultural movement and this round will help us scale SourceX, our AI-powered engine, and enter new categories and markets." Culture Circle aims to position India as a significant force in the global streetwear and luxury landscape while making high-end fashion more accessible to Gen Z consumers.

Top AI Firms Fall Short on Safety, New Studies Find

Yahoo

20 minutes ago

Yahoo

Top AI Firms Fall Short on Safety, New Studies Find

The logos of Google Gemini, ChatGPT, Microsoft Copilot, Claude by Anthropic, Perplexity, and Bing apps are displayed on the screen of a smartphone. Credit - Jaque Silva—NurPhoto/Getty Images The world's leading AI companies have 'unacceptable' levels of risk management, and a 'striking lack of commitment to many areas of safety,' according to two new studies published Thursday. The risks of even today's AI—by the admission of many top companies themselves—could include AI helping bad actors carry out cyberattacks or create bioweapons. Future AI models, top scientists worry, could escape human control altogether. The studies were carried out by the nonprofits SaferAI and the Future of Life Institute (FLI). Each was the second of its kind, in what the groups hope will be a running series that incentivizes top AI companies to improve their practices. 'We want to make it really easy for people to see who is not just talking the talk, but who is also walking the walk,' says Max Tegmark, president of the FLI. Read More: Some Top AI Labs Have 'Very Weak' Risk Management, Study Finds SaferAI assessed top AI companies' risk management protocols (also known as responsible scaling policies) to score each company on its approach to identifying and mitigating AI risks. No AI company scored better than 'weak' in SaferAI's assessment of their risk management maturity. The highest scorer was Anthropic (35%), followed by OpenAI (33%), Meta (22%), and Google DeepMind (20%). Elon Musk's xAI scored 18%. Two companies, Anthropic and Google DeepMind, received lower scores than the first time the study was carried out, in October 2024. The result means that OpenAI has overtaken Google as second place in SaferAI's ratings. Siméon Campos, founder of SaferAI, said Google scored comparatively low despite doing some good safety research, because the company makes few solid commitments in its policies. The company also released a frontier model earlier this year, Gemini 2.5, without sharing safety information—in what Campos called an 'egregious failure.' A spokesperson for Google DeepMind told TIME: 'We are committed to developing AI safely and securely to benefit society. AI safety measures encompass a wide spectrum of potential mitigations. These recent reports don't take into account all of Google DeepMind's AI safety efforts, nor all of the industry benchmarks. Our comprehensive approach to AI safety and security extends well beyond what's captured.' Anthropic's score also declined since SaferAI's last survey in October. This was due in part to changes the company made to its responsible scaling policy days before the release of Claude 4 models, which saw Anthropic remove its commitments to tackle insider threats by the time it released models of that caliber. 'That's very bad process,' Campos says. Anthropic did not immediately respond to a request for comment. The study's authors also said that its methodology had become more detailed since last October, which accounts for some of the differences in scoring. The companies that improved their scores the most were xAI, which scored 18% compared to 0% in October; and Meta, which scored 22% compared to its previous score of 14%. The FLI's study was broader—looking not only at risk management practices, but also companies' approaches to current harms, existential safety, governance, and information sharing. A panel of six independent experts scored each company based on a review of publicly available material such as policies, research papers, and news reports, together with additional nonpublic data that companies were given the opportunity to provide. The highest grade was scored by Anthropic (a C plus). OpenAI scored a C, and Google scored a C minus. (xAI and Meta both scored D.) However, in FLI's scores for each company's approach to 'existential safety,' every company scored D or below. 'They're all saying: we want to build superintelligent machines that can outsmart humans in every which way, and nonetheless, they don't have a plan for how they're going to control this stuff,' Tegmark says. Write to Billy Perrigo at Solve the daily Crossword