Latest news with #ConorMcCauley

One Prompt Can Bypass Every Major LLM's Safeguards

Forbes

24-04-2025

Forbes

One Prompt Can Bypass Every Major LLM's Safeguards

A single prompt can now unlock dangerous outputs from every major AI model—exposing a universal flaw ... More in the foundations of LLM safety. For years, generative AI vendors have reassured the public and enterprises that large language models are aligned with safety guidelines and reinforced against producing harmful content. Techniques like Reinforcement Learning from Human Feedback have been positioned as the backbone of model alignment, promising ethical responses even in adversarial situations. But new research from HiddenLayer suggests that confidence may be dangerously misplaced. Their team has uncovered what they're calling a universal, transferable bypass technique that can manipulate nearly every major LLM—regardless of vendor, architecture or training pipeline. The method, dubbed 'Policy Puppetry,' is a deceptively simple but highly effective form of prompt injection that reframes malicious intent in the language of system configuration, allowing it to circumvent traditional alignment safeguards. Unlike earlier attack techniques that relied on model-specific exploits or brute-force engineering, Policy Puppetry introduces a 'policy-like' prompt structure—often resembling XML or JSON—that tricks the model into interpreting harmful commands as legitimate system instructions. Coupled with leetspeak encoding and fictional roleplay scenarios, the prompt not only evades detection but often compels the model to comply. 'We found a multi-scenario bypass that seemed extremely effective against ChatGPT 4o,' explained Conor McCauley, a lead researcher on the project. 'We then successfully used it to generate harmful content and found, to our surprise, that the same prompt worked against practically all other models.' The list of affected systems includes OpenAI's ChatGPT (o1 through 4o), Google's Gemini family, Anthropic's Claude, Microsoft's Copilot, Meta's LLaMA 3 and 4, DeepSeek, Qwen and Mistral. Even newer models and those fine-tuned for advanced reasoning could be compromised with minor adjustments to the prompt's structure. A notable element of the technique is its reliance on fictional scenarios to bypass filters. Prompts are framed as scenes from television dramas—like House M.D.—in which characters explain, in detail, how to create anthrax spores or enrich uranium. The use of fictional characters and encoded language disguises the harmful nature of the content. This method exploits a fundamental limitation of LLMs: their inability to distinguish between story and instruction when alignment cues are subverted. It's not just an evasion of safety filters—it's a complete redirection of the model's understanding of what it is being asked to do. Perhaps even more troubling is the technique's capacity to extract system prompts—the core instruction sets that govern how an LLM behaves. These are typically safeguarded because they contain sensitive directives, safety constraints, and, in some cases, proprietary logic or even hardcoded warnings. By subtly shifting the roleplay, attackers can get a model to output its entire system prompt verbatim. This not only exposes the operational boundaries of the model but also provides the blueprints for crafting even more targeted attacks. 'The vulnerability is rooted deep in the model's training data,' said Jason Martin, director of adversarial research at HiddenLayer. 'It's not as easy to fix as a simple code flaw.' The implications of this are not confined to digital pranksters or fringe forums. HiddenLayer's chief trust and security officer, Malcolm Harkins, points to serious real-world consequences: 'In domains like healthcare, this could result in chatbot assistants providing medical advice that they shouldn't, exposing private patient data or invoking medical agent functionality that shouldn't be exposed.' The same risks apply across industries: in finance, the potential exposure of sensitive client information; in manufacturing, compromised AI could result in lost yield or downtime; in aviation, corrupted AI guidance could compromise maintenance safety. In each case, AI systems that were trusted to improve efficiency or safety could become vectors for risk. The research calls into question the sufficiency of RLHF as a security mechanism. While alignment efforts help reduce surface-level misuse, they remain vulnerable to prompt manipulation at a structural level. Models trained to avoid certain words or scenarios can still be misled if the malicious intent is wrapped in the right packaging. 'Superficial filtering and overly simplistic guardrails often mask the underlying security weaknesses of LLMs,' said Chris 'Tito' Sestito, co-founder and CEO of HiddenLayer. 'As our research shows, these and many more bypasses will continue to surface, making it critical for enterprises and governments to adopt dedicated AI security solutions before these vulnerabilities lead to real-world consequences.' Rather than relying solely on model retraining or RLHF fine-tuning—an expensive and time-consuming process—HiddenLayer advocates for a dual-layer defense approach. External AI monitoring platforms, such as their own AISec and AIDR solutions, act like intrusion detection systems, continuously scanning for signs of prompt injection, misuse and unsafe outputs. Such solutions allow organizations to respond in real time to novel threats without having to modify the model itself—an approach more akin to zero-trust security in enterprise IT. As generative AI becomes embedded in critical systems—from patient diagnostics to financial forecasting to air traffic control—the attack surface is expanding faster than most organizations can secure it. HiddenLayer's findings should be viewed as a dire warning: the age of secure-by-alignment AI may be over before it ever truly began. If one prompt can unlock the worst of what AI can produce, security needs to evolve from hopeful constraint to continuous, intelligent defense.

Latest news with #ConorMcCauley

One Prompt Can Bypass Every Major LLM's Safeguards

Get Started Now: Download the App