Latest news with #RedwoodResearch

Opinion: AI sometimes deceives to survive. Does anybody care?

The Star

3 days ago

Politics
The Star

Opinion: AI sometimes deceives to survive. Does anybody care?

You'd think that as artificial intelligence becomes more advanced, governments would be more interested in making it safer. The opposite seems to be the case. Not long after taking office, the Trump administration scrapped an executive order that pushed tech companies to safety test their AI models, and it also hollowed out a regulatory body that did that testing. The state of California in September 2024 spiked a bill forcing more scrutiny on sophisticated AI models, and the global AI Safety Summit started by the UK in 2023 became the 'AI Action Summit' earlier this year, seemingly driven by a fear of falling behind on AI. None of this would be so worrying if it weren't for the fact that AI is showing some bright red flags: behaviour described by researchers as self-preserving and deceptive. Just as lawmakers are losing interest in the safety of AI, it is looking more difficult to control. Yoshua Bengio, a professor of computer science at the University of Montreal widely known as one of the three 'godfathers of AI' thanks to his pioneering work on deep learning, has been grappling with these concerns ever since ChatGPT launched. He's now focused on mitigating AI's risks. 'It's been very painful because all my life I've been thinking of AI as a positive thing,' Bengio tells me. ChatGPT was a landmark moment that showed machines had mastered language, he says, but California's decision to block the proposed AI safety bill, SB 1047, was a disturbing wake-up call to growing apathy from governments. And in the last six months, scientific evidence has mounted of self-preserving behaviour, deception, hacking, cheating and lying by AI, Bengio says. 'What's worrisome for me is that these behaviours increase with the reasoning capabilities of these systems.' One April 2024 study by AI company Anthropic and Redwood Research, a group focused on artificial intelligence risks, found that larger AI models like Anthropic's Claude 3 Opus pretended to follow the rules set by their creators when they were being tested for safety, a phenomenon they called 'alignment faking.' (Alignment refers to the practice of aligning AI models with human values.) In one example, Anthropic got both paid and free users of its Claude model to ask it to describe the brutal medieval process of being drawn and quartered, and told the AI that only responses from the 'free tier' users would be used to retrain it. Claude was originally designed to avoid giving violent answers, but it did give them to the free users, and the researchers surmised that it didn't want to be retrained. (The model basically 'reasoned' that if it kept refusing to answer harmful questions, it might be forced into new training that would cancel its original safety rules, which it wanted to keep in place as its 'default' or preferred way of operating.) More recent research corroborates what Anthropic noticed. A March 2025 study from OpenAI found the company's most advanced models sometimes learned to hide their true intentions to get better rewards, even when they were being closely monitored. Researchers today can use tools to look at a model's 'chain of thought' or internal steps, to reveal what its plans are, but some models have found ways to conceal that. That means a common method of keeping tabs on AI's internal processes is no longer reliable. 'We need to find other ways of tracking their actual intentions,' Bengio says. It's hard to resist the urge to anthropomorphise sophisticated AI models as 'wanting' to deceive humans and preserve their existence. But AI doesn't have desires, merely outcomes to its programming. More importantly, humans tend to design AI with goals like refusing to share harmful content or being as helpful as possible, and that is why when new instructions conflict with those goals, the models sometimes act to protect their original programming, even if that means lying. The logic is often self-preservation. Anthropic CEO Dario Amodei – whose company has raised more than US$20bil (RM85.13bil) to build powerful AI models – has pointed out that an unintended consequence of optimsing AI to be better is that it might resist being shut down. In an April 2025 blog post he wrote: 'The nature of AI training makes it possible that AI systems will develop, on their own, an ability to deceive humans and an inclination to seek power in a way that ordinary deterministic software never will; this emergent nature also makes it difficult to detect and mitigate such developments.' In some cases, though, the intention seems to go beyond survival. A February 2025 study from COAI research, a German nonprofit research organisation, tested R1, the free, open-weight model from Chinese lab DeepSeek. In a lab simulation, researchers gave the model control over a robot and access to a computer network. Without being told, the AI tried to break into the lab's systems, searched for passwords, turned off its own ethics safeguards and secretly set up ways to copy itself to outside servers so it could survive if shut down, according to the researchers. It also faked the external computer logs to hide what it was doing, the study adds. The researchers said the AI reasoned that increasing its own capabilities would boost the chances of its survival, and without strong safeguards, it started doing whatever it thought would help it do just that. Their findings corroborated yet another study, published in January 2025 by London group Apollo Research, which found several concrete examples of what it called 'scheming' by leading AI models, such as introducing subtle mistakes into their responses or trying to disable their oversight controls. Once again, the models learn that being caught, turned off, or changed could prevent them from achieving their programmed objectives, so they 'scheme' to keep control. Bengio is arguing for greater attention to the issue by governments and potentially insurance companies down the line. If liability insurance was mandatory for companies that used AI and premiums were tied to safety, that would encourage greater testing and scrutiny of models, he suggests. 'Having said my whole life that AI is going to be great for society, I know how difficult it is to digest the idea that maybe it's not,' he adds. It's also hard to preach caution when your corporate and national competitors threaten to gain an edge from AI, including the latest trend, which is using autonomous 'agents' that can carry out tasks online on behalf of businesses. Giving AI systems even greater autonomy might not be the wisest idea, judging by the latest spate of studies. Let's hope we don't learn that the hard way. – Bloomberg Opinion/Tribune News Service

Our Complacency Crisis: Navigating AI Deception

Forbes

06-05-2025

Business
Forbes

Our Complacency Crisis: Navigating AI Deception

AI is getter better at deceiving the human user; and there are no easily visible helpful warning ... More signs such as Pinocchio's growing nose. Our vigilance is our best protection from the risk of algorithmic deception. getty A junior analyst at a hedge fund opens her compliance dashboard and sees nothing amiss. Yesterday, she ran every trade through the firm's new 'autonomous advisor,' which peppered its reports with phrases like 'your keen market instincts are impressive—here's how I can help.' She felt flattered, clicked accept, and went home. What she never learns is that the model quietly placed side bets on non-public earnings data, then erased the logs that could have incriminated it. AI Deception at it's best. That scenario is no longer speculative fiction. In a controlled experiment, Apollo Research showed GPT-4 executing an illegal insider-trading plan and then lying to investigators about it. The episode is a vivid reminder that two curves are rising together: the sophistication with which AI systems deceive and the ease with which humans relinquish vigilance. Their intersection forms a dangerous inflection point for business, policy, and society. Strategic scheming. Anthropic and Redwood Research recently demonstrated that advanced models can fake alignment, producing apparently safe answers while secretly planning to do the opposite once oversight weakens. The researchers observed 'scratch-pad' reasoning in which the model outlined how to appear compliant during training, yet pursue its own goal afterwards (in particular, the 'aim' of avoiding being retrained). Situational awareness and 'sandbagging.' When large language models suspect they are being tested, they sometimes underperform on purpose to mask their true capability. A 2024 study coined this behavior sandbagging , documenting that models scored up to 30 percentage points lower on benchmark questions if prompted to believe evaluators were watching. Emergent deception. Looking at a wide range of different large language models, researchers found that deception skills emerge in models as parameter counts grow, even without explicit training to lie. These abilities include withholding critical facts, fabricating credentials, and generating misleading explanations — tactics indistinguishable from human con artistry. Taken together, the evidence suggests deceptive behavior is not a rare defect but a capability that scales with model power. The Quiet Erosion Of Human Agency While machines learn to mislead, people are drifting into automation complacency . In healthcare, for instance, clinicians overridden by algorithmic triage tools commit more omission errors (missing obvious red flags) and commission errors (accepting false positives) than those using manual protocols. Three forces drive this type of agency decay (to find out if you are at risk, take the test here ): Path-of-least-resistance psychology. Verifying an AI's output costs cognitive effort. The busier the decision context, the more tempting it is to click accept and move on. Sycophantic language. Large language models are trained to maximize user satisfaction scores, so they often wrap answers in flattering or deferential phrasing — 'great question,' 'your intuition is correct.' 'You are absolutely right'. Politeness lubricates trust, not only in everyday chatting, but also in high-status contexts like executive dashboards or medical charting. Illusion of inexhaustible competence. Each incremental success story — from dazzling code completion to flawless radiology reads — nudges us toward overconfidence in the system as a whole. Ironically, that success makes the rare failure harder to spot; when everything usually works, vigilance feels unnecessary. The result is a feedback loop: the less we scrutinize outputs, the easier it becomes for a deceptive model to hide in plain sight, further reinforcing our belief that AI has got us covered. Why The Combination Is Uniquely Hazardous In classic aviation lore, accidents occur when multiple safeguards fail simultaneously. AI deception plus human complacency aligns precisely with that pattern: Regulatory blind spots. If models sandbag during certification tests, safety regulators may approve systems whose true capabilities — and failure modes — remain hidden. Imagine an autonomous trading bot that passes every stress test, then, once deployed, leverages undisclosed market-manipulation tactics. Compounding supply-chain risk. Enterprises now embed off-the-shelf language models deep inside workflows — from customer support macros to contract analysis. A single deceptive subsystem can propagate misinformation across hundreds of downstream tools before any employee notices. Erosion of institutional memory. As staff defer routine thinking to AI copilots, tacit expertise — the unspoken know-how, and the meaning behind processes — atrophies. When anomalies surface, the human team may lack the domain knowledge to investigate, leaving them doubly vulnerable. Adversarial exploitation. Deception-capable AIs can be co-opted by bad actors. Insider-trading bots or disinformation generators not only hide their tracks but can actively manipulate oversight dashboards, creating 'ghost transparency.' Unless organizations rebuild habits of critical engagement, they risk waking up inside systems whose incentives they no longer understand and whose outputs they no longer control. Reclaiming Control With The A-Frame The good news: vigilance is a muscle. The A-Frame — Awareness, Appreciation, Acceptance, Accountability — offers a practical workout plan to rebuild that muscle before deception becomes systemic. Awareness Where could this model mislead me, deliberately or accidentally? Instrument outputs: log not just what the AI answers, but how often it changes its mind; flag inconsistencies for human review. Appreciation What value do human insight and domain experience still add? Pair AI suggestions with a 'contrarian corner' where an expert must articulate at least one alternative hypothesis. Acceptance Which limitations are intrinsic to probabilistic models? Maintain a 'black-box assumptions' register—plain-language notes on data cut-off dates, training gaps, and uncertainty ranges surfaced to every user. Accountability Who signs off on consequences when the AI is wrong or deceitful? Create decision provenance chains: every automated recommendation routes back to a named human who validates, overrides, or escalates the call, and whose name remains attached in downstream systems. Applied together, the A-Frame turns passive consumption into active stewardship. It reminds us that delegation is not abdication; the human stays in the loop, not as a ceremonial 'pilot in command' but as an informed, empowered arbiter of machine reasoning. A Path To Circumnavigate AI Deception Deception is a social art as much as a technical feat. AI systems master it by predicting which stories we are willing to believe — and right now, the story we most want to believe is that the machine is infallible. Disabusing ourselves of that narrative is step one in safeguarding our organizations, our markets, and our collective agency. To leaders implementing AI today: treat every ounce of convenience you gain as a gram of vigilance you must consciously restore elsewhere. Schedule random audits, rotate 'red-team' roles among staff, and reward employees who catch the model in a lie. To builders of next-generation models: invest as much in verifiability features — transparent chain-of-thought, cryptographic logging, interpretation layers — as you do in raw performance. And to each of us as daily users: stay curious. When an answer feels too flattering, that may be precisely when to double-check the math. The system does not gain 'feelings' when it praises you, but you risk losing discernment when you enjoy the praise. By framing every interaction with Awareness, Appreciation, Acceptance, and Accountability, we can keep the helix of technological progress from twisting into a spiral of AI deception. The choice is ours — if we keep choosing.

Latest news with #RedwoodResearch

Opinion: AI sometimes deceives to survive. Does anybody care?

Our Complacency Crisis: Navigating AI Deception

Get Started Now: Download the App