logo
One Prompt Can Bypass Every Major LLM's Safeguards

One Prompt Can Bypass Every Major LLM's Safeguards

Forbes24-04-2025

A single prompt can now unlock dangerous outputs from every major AI model—exposing a universal flaw ... More in the foundations of LLM safety.
For years, generative AI vendors have reassured the public and enterprises that large language models are aligned with safety guidelines and reinforced against producing harmful content. Techniques like Reinforcement Learning from Human Feedback have been positioned as the backbone of model alignment, promising ethical responses even in adversarial situations.
But new research from HiddenLayer suggests that confidence may be dangerously misplaced.
Their team has uncovered what they're calling a universal, transferable bypass technique that can manipulate nearly every major LLM—regardless of vendor, architecture or training pipeline. The method, dubbed 'Policy Puppetry,' is a deceptively simple but highly effective form of prompt injection that reframes malicious intent in the language of system configuration, allowing it to circumvent traditional alignment safeguards.
Unlike earlier attack techniques that relied on model-specific exploits or brute-force engineering, Policy Puppetry introduces a 'policy-like' prompt structure—often resembling XML or JSON—that tricks the model into interpreting harmful commands as legitimate system instructions. Coupled with leetspeak encoding and fictional roleplay scenarios, the prompt not only evades detection but often compels the model to comply.
'We found a multi-scenario bypass that seemed extremely effective against ChatGPT 4o,' explained Conor McCauley, a lead researcher on the project. 'We then successfully used it to generate harmful content and found, to our surprise, that the same prompt worked against practically all other models.'
The list of affected systems includes OpenAI's ChatGPT (o1 through 4o), Google's Gemini family, Anthropic's Claude, Microsoft's Copilot, Meta's LLaMA 3 and 4, DeepSeek, Qwen and Mistral. Even newer models and those fine-tuned for advanced reasoning could be compromised with minor adjustments to the prompt's structure.
A notable element of the technique is its reliance on fictional scenarios to bypass filters. Prompts are framed as scenes from television dramas—like House M.D.—in which characters explain, in detail, how to create anthrax spores or enrich uranium. The use of fictional characters and encoded language disguises the harmful nature of the content.
This method exploits a fundamental limitation of LLMs: their inability to distinguish between story and instruction when alignment cues are subverted. It's not just an evasion of safety filters—it's a complete redirection of the model's understanding of what it is being asked to do.
Perhaps even more troubling is the technique's capacity to extract system prompts—the core instruction sets that govern how an LLM behaves. These are typically safeguarded because they contain sensitive directives, safety constraints, and, in some cases, proprietary logic or even hardcoded warnings.
By subtly shifting the roleplay, attackers can get a model to output its entire system prompt verbatim. This not only exposes the operational boundaries of the model but also provides the blueprints for crafting even more targeted attacks.
'The vulnerability is rooted deep in the model's training data,' said Jason Martin, director of adversarial research at HiddenLayer. 'It's not as easy to fix as a simple code flaw.'
The implications of this are not confined to digital pranksters or fringe forums. HiddenLayer's chief trust and security officer, Malcolm Harkins, points to serious real-world consequences: 'In domains like healthcare, this could result in chatbot assistants providing medical advice that they shouldn't, exposing private patient data or invoking medical agent functionality that shouldn't be exposed.'
The same risks apply across industries: in finance, the potential exposure of sensitive client information; in manufacturing, compromised AI could result in lost yield or downtime; in aviation, corrupted AI guidance could compromise maintenance safety.
In each case, AI systems that were trusted to improve efficiency or safety could become vectors for risk.
The research calls into question the sufficiency of RLHF as a security mechanism. While alignment efforts help reduce surface-level misuse, they remain vulnerable to prompt manipulation at a structural level. Models trained to avoid certain words or scenarios can still be misled if the malicious intent is wrapped in the right packaging.
'Superficial filtering and overly simplistic guardrails often mask the underlying security weaknesses of LLMs,' said Chris 'Tito' Sestito, co-founder and CEO of HiddenLayer. 'As our research shows, these and many more bypasses will continue to surface, making it critical for enterprises and governments to adopt dedicated AI security solutions before these vulnerabilities lead to real-world consequences.'
Rather than relying solely on model retraining or RLHF fine-tuning—an expensive and time-consuming process—HiddenLayer advocates for a dual-layer defense approach. External AI monitoring platforms, such as their own AISec and AIDR solutions, act like intrusion detection systems, continuously scanning for signs of prompt injection, misuse and unsafe outputs.
Such solutions allow organizations to respond in real time to novel threats without having to modify the model itself—an approach more akin to zero-trust security in enterprise IT.
As generative AI becomes embedded in critical systems—from patient diagnostics to financial forecasting to air traffic control—the attack surface is expanding faster than most organizations can secure it. HiddenLayer's findings should be viewed as a dire warning: the age of secure-by-alignment AI may be over before it ever truly began.
If one prompt can unlock the worst of what AI can produce, security needs to evolve from hopeful constraint to continuous, intelligent defense.

Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

This AI Company Wants Washington To Keep Its Competitors Off the Market
This AI Company Wants Washington To Keep Its Competitors Off the Market

Yahoo

time27 minutes ago

  • Yahoo

This AI Company Wants Washington To Keep Its Competitors Off the Market

Dario Amodei, CEO of the artificial intelligence company Anthropic, published a guest essay in The New York Times Thursday arguing against a proposed 10-year moratorium on state AI regulation. Amodei argues that a patchwork of regulations would be better than no regulation whatsoever. Skepticism is warranted whenever the head of an incumbent firm calls for more regulation, and this case is no different. If Amodei gets his way, Anthropic would face less competition—to the detriment of AI innovation, AI security, and the consumer. Amodei's op-ed came in a response to a provision of the so-called One Big Beautiful Bill Act, which would prevent any states, cities, and counties from enforcing any regulation that specifically targets AI models, AI systems, or automated decision systems for 10 years. Senate Republicans have amended the clause from a simple requirement to a condition for receiving federal broadband funds, in order to comply with the Byrd Rule, which in Politico's words "blocks anything but budgetary issues from inclusion in reconciliation." Amodei begins by describing how, in a recent stress test conducted at his company, a chatbot threatened an experimenter to forward evidence of his adultery to his wife unless he withdrew plans to shut the AI down. The CEO also raises more tangible concerns, such as reports that a version of Google's Gemini model is "approaching a point where it could help people carry out cyberattacks." Matthew Mittelsteadt, a technology fellow at the Cato Institute, tells Reason that the stress test was "very contrived" and that "there are no AI systems where you must prompt it to turn it off." You can just turn it off. He also acknowledges that, while there is "a real cybersecurity danger [of] AI being used to spot and exploit cyber-vulnerabilities, it can also be used to spot and patch" them. Outside of cyberspace and in, well, actual space, Amodei sounds the alarm that AI could acquire the ability "to produce biological and other weapons." But there's nothing new about that: Knowledge and reasoning, organic or artificial—ultimately wielded by people in either case—can be used to cause problems as well as to solve them. An AI that can model three-dimensional protein structures to create cures for previously untreatable diseases can also create virulent, lethal pathogens. Amodei recognizes the double-edged nature of AI and says voluntary model evaluation and publication are insufficient to ensure that benefits outweigh costs. Instead of a 10-year moratorium, Amodei calls on the White House and Congress to work together on a transparency standard for AI companies. In lieu of federal testing standards, Amodei says state laws should pick up the slack without being "overly prescriptive or burdensome." But that caveat is exactly the kind of wishful thinking Amodei indicts proponents of the moratorium for: Not only would 50 state transparency laws be burdensome, says Mittelsteadt, but they could "actually make models less legible." Neil Chilson of the Abundance Institute also inveighed against Amodei's call for state-level regulation, which is much more onerous than Amodei suggests. "The leading state proposals…include audit requirements, algorithmic assessments, consumer disclosures, and some even have criminal penalties," Chilson tweeted, so "the real debate isn't 'transparency vs. nothing,' but 'transparency-only federal floor vs. intrusive state regimes with audits, liability, and even criminal sanctions.'" Mittelsteadt thinks national transparency regulation is "absolutely the way to go." But how the U.S. chooses to regulate AI might not have much bearing on Skynet-doomsday scenarios, because, while America leads the way in AI, it's not the only player in the game. "If bad actors abroad create Amodei's theoretical 'kill everyone bot,' no [American] law will matter," says Mittelsteadt. But such a law can "stand in the way of good actors using these tools for defense." Amodei is not the only CEO of a leading AI company to call for regulation. In 2023, Sam Altman, co-founder and then-CEO of Open AI, called on lawmakers to consider "intergovernmental oversight mechanisms and standard-setting" of AI. In both cases and in any others that come along, the public should beware of calls for AI regulation that will foreclose market entry, protect incumbent firms' profits from being bid away by competitors, and reduce the incentives to maintain market share the benign way: through innovation and product differentiation. The post This AI Company Wants Washington To Keep Its Competitors Off the Market appeared first on

Apple Poised to Monetize AI at WWDC 2025
Apple Poised to Monetize AI at WWDC 2025

Yahoo

time28 minutes ago

  • Yahoo

Apple Poised to Monetize AI at WWDC 2025

Apple (NASDAQ:AAPL) looks ready to kick off its AI monetization era with system-wide updates at WWDC 2025, Wedbush's Daniel Ives says, setting the stage for paid AI features across the Apple ecosystem. The keynote at 1 p.m. ET today at Apple Park will unveil '26 upgrades for macOS, iOS and iPadOS powered by Apple Intelligence, countering Street skepticism about a slow AI rollout. In his Monday note, Ives argued that WWDC marks the start of Apple's AI cash flow, not a mere feature demo, as the company layers new AI-driven capabilities into core OS updates. He expects details on Siri's deeper integration with Google's Gemini and OpenAI's ChatGPT, demonstrating how Apple will embed AI across native apps to drive user engagementand, ultimately, device upgrades when iPhone 17 ships next year. With over 100 million iPhones in China due for an upgrade, Ives also eyes an announcement on Apple's partnership with Alibaba (NYSE:BABA) to deploy AI services locally, a move he calls critical for unlocking growth in the world's largest smartphone market. Wedbush reiterates its Outperform rating and raises its 12-month price target to $270, noting that Apple's edge isn't in building the most advanced large language model but in toll-collecting'' on its vast hardware base. Apple's unmatched ecosystem ensures that any third-party AI app must run through Cupertino, highlighting how AI unlocks new services revenue without upending its product strategy. Why It Matters: WWDC's AI announcements could shift Wall Street sentiment, validating Apple's long-term AI strategy and fueling expectations for recurring software revenue beyond hardware sales. Investors will watch for concrete details on AI feature pricing, Siri-Gemini integrations and the Alibaba tie-up during today's keynote and in the follow-up developer sessions. This article first appeared on GuruFocus. Error while retrieving data Sign in to access your portfolio Error while retrieving data Error while retrieving data Error while retrieving data Error while retrieving data

Anthropic's AI-generated blog dies an early death
Anthropic's AI-generated blog dies an early death

Yahoo

time33 minutes ago

  • Yahoo

Anthropic's AI-generated blog dies an early death

Claude's blog is no more. A week after TechCrunch profiled Anthropic's experiment to task the company's Claude AI models with writing blog posts, Anthropic wound down the blog and redirected the address to its homepage. Sometime over the weekend, the Claude Explains blog disappeared — along with its initial few posts. A source familiar tells TechCrunch the blog was a "pilot" meant to help Anthropic's team combine customer requests for explainer-type "tips and tricks" content with marketing goals. Claude Explains, which had a dedicated page on Anthropic's website and was edited for accuracy by humans, was populated by posts on technical topics related to various Claude use cases (e.g. 'Simplify complex codebases with Claude'). The blog, which was intended to be a showcase of sorts for Claude's writing abilities, wasn't clear about how much of Claude's raw writing was making its way into each post. An Anthropic spokesperson previously told TechCrunch that the blog was overseen by "subject matter experts and editorial teams" who 'enhance[d]' Claude's drafts with 'insights, practical examples, and […] contextual knowledge.' The spokesperson also said Claude Explains would expand to topics ranging from creative writing to data analysis to business strategy. Apparently, those plans changed in pretty short order. "[Claude Explains is a] demonstration of how human expertise and AI capabilities can work together,' the spokesperson told TechCrunch earlier this month. "[The blog] is an early example of how teams can use AI to augment their work and provide greater value to their users. Rather than replacing human expertise, we're showing how AI can amplify what subject matter experts can accomplish." Claude Explains didn't get the rosiest reception on social media, in part due to the lack of transparency about which copy was AI-generated. Some users pointed out it looked a lot like an attempt to automate content marketing, an ad tactic that relies on generating content on popular topics to serve as a funnel for potential customers. More than 24 websites were linking to Claude Explains posts before Anthropic wound down the pilot, according to search engine optimization tool Ahrefs. That's not bad for a blog that was only live for around a month. Anthropic might've also grown wary of implying Claude performs better at writing tasks than is actually the case. Even the best AI today is prone to confidently making things up, which has led to embarrassing gaffes on the part of publishers that have publicly embraced the tech. For example, Bloomberg has had to correct dozens of AI-generated summaries of its articles, and G/O Media's error-riddled AI-written features — published against editors' wishes — attracted widespread ridicule. This article originally appeared on TechCrunch at Error while retrieving data Sign in to access your portfolio Error while retrieving data Error while retrieving data Error while retrieving data Error while retrieving data

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into the world of global news and events? Download our app today from your preferred app store and start exploring.
app-storeplay-store