In AI race, safety falls behind as models learn to lie, deceive

Agencies
The most advanced AI models are beginning to display concerning behaviors, including lying, deception, manipulation and even issuing threats to their developers in pursuit of their goals.
In one particularly jarring example, under threat of being unplugged, Anthropic's latest creation, Claude 4, lashed back by blackmailing an engineer and threatening to reveal an extramarital affair.
Meanwhile, ChatGPT-creator OpenAI's O1 tried to download itself onto external servers and denied it when caught red-handed.
These episodes highlight a sobering reality: More than two years after ChatGPT shook the world, AI researchers still don't fully understand how their own creations work.
Yet, the race to deploy increasingly powerful models continues at breakneck speed.
This deceptive behavior appears linked to the emergence of 'reasoning' models – AI systems that work through problems step-by-step rather than generating instant responses.
According to Simon Goldstein, a professor at the University of Hong Kong, these newer models are particularly prone to such troubling outbursts.'O1 was the first large model where we saw this kind of behavior,' explained Marius Hobbhahn, head of Apollo Research, which specializes in testing major AI systems.
These models sometimes simulate 'alignment' – appearing to follow instructions while secretly pursuing different objectives.
For now, this deceptive behavior only emerges when researchers deliberately stress-test the models with extreme scenarios.
But as Michael Chen from evaluation organization METR warned, 'It's an open question whether future, more capable models will have a tendency toward honesty or deception.' The concerning behavior goes far beyond typical AI 'hallucinations' or simple mistakes.
Hobbhahn insisted that despite constant pressure-testing by users, 'what we're observing is a real phenomenon. We're not making anything up.' Users report that models are 'lying to them and making up evidence,' according to Apollo Research's co-founder.
'This is not just hallucinations. There's a very strategic kind of deception.' The challenge is compounded by limited research resources.
While companies like Anthropic and OpenAI do engage external firms like Apollo to study their systems, researchers say more transparency is needed.
As Chen noted, greater access 'for AI safety research would enable better understanding and mitigation of deception.' Another handicap: the research world and nonprofits 'have orders of magnitude less computing resources than AI companies. This is very limiting,' noted Mantas Mazeika from the Center for AI Safety (CAIS).
Current regulations aren't designed for these new problems.The European Union's AI legislation focuses primarily on how humans use AI models, not on preventing the models themselves from misbehaving.
In the U.S., the Trump administration shows little interest in urgent AI regulation, and Congress may even prohibit states from creating their own AI rules.
Goldstein believes the issue will become more prominent as AI agents – autonomous tools capable of performing complex human tasks – become widespread.
'I don't think there's much awareness yet,' he said.
All this is taking place in a context of fierce competition.
Even companies that position themselves as safety-focused, like Amazon-backed Anthropic, are 'constantly trying to beat OpenAI and release the newest model,' said Goldstein.This breakneck pace leaves little time for thorough safety testing and corrections.
'Right now, capabilities are moving faster than understanding and safety,' Hobbhahn acknowledged, 'but we're still in a position where we could turn it around.' Researchers are exploring various approaches to address these challenges.
Some advocate for 'interpretability' – an emerging field focused on understanding how AI models work internally, though experts like CAIS director Dan Hendrycks, remain skeptical of this approach.
Market forces may also provide some pressure for solutions.As Mazeika pointed out, AI's deceptive behavior 'could hinder adoption if it's very prevalent, which creates a strong incentive for companies to solve it.' Goldstein suggested more radical approaches, including using the courts to hold AI companies accountable through lawsuits when their systems cause harm.
He even proposed 'holding AI agents legally responsible' for accidents or crimes – a concept that would fundamentally change how we think about AI accountability.

Hashtags

Science

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Qatar Tribune

12 hours ago

Qatar Tribune

In AI race, safety falls behind as models learn to lie, deceive

Agencies The most advanced AI models are beginning to display concerning behaviors, including lying, deception, manipulation and even issuing threats to their developers in pursuit of their goals. In one particularly jarring example, under threat of being unplugged, Anthropic's latest creation, Claude 4, lashed back by blackmailing an engineer and threatening to reveal an extramarital affair. Meanwhile, ChatGPT-creator OpenAI's O1 tried to download itself onto external servers and denied it when caught red-handed. These episodes highlight a sobering reality: More than two years after ChatGPT shook the world, AI researchers still don't fully understand how their own creations work. Yet, the race to deploy increasingly powerful models continues at breakneck speed. This deceptive behavior appears linked to the emergence of 'reasoning' models – AI systems that work through problems step-by-step rather than generating instant responses. According to Simon Goldstein, a professor at the University of Hong Kong, these newer models are particularly prone to such troubling outbursts.'O1 was the first large model where we saw this kind of behavior,' explained Marius Hobbhahn, head of Apollo Research, which specializes in testing major AI systems. These models sometimes simulate 'alignment' – appearing to follow instructions while secretly pursuing different objectives. For now, this deceptive behavior only emerges when researchers deliberately stress-test the models with extreme scenarios. But as Michael Chen from evaluation organization METR warned, 'It's an open question whether future, more capable models will have a tendency toward honesty or deception.' The concerning behavior goes far beyond typical AI 'hallucinations' or simple mistakes. Hobbhahn insisted that despite constant pressure-testing by users, 'what we're observing is a real phenomenon. We're not making anything up.' Users report that models are 'lying to them and making up evidence,' according to Apollo Research's co-founder. 'This is not just hallucinations. There's a very strategic kind of deception.' The challenge is compounded by limited research resources. While companies like Anthropic and OpenAI do engage external firms like Apollo to study their systems, researchers say more transparency is needed. As Chen noted, greater access 'for AI safety research would enable better understanding and mitigation of deception.' Another handicap: the research world and nonprofits 'have orders of magnitude less computing resources than AI companies. This is very limiting,' noted Mantas Mazeika from the Center for AI Safety (CAIS). Current regulations aren't designed for these new European Union's AI legislation focuses primarily on how humans use AI models, not on preventing the models themselves from misbehaving. In the U.S., the Trump administration shows little interest in urgent AI regulation, and Congress may even prohibit states from creating their own AI rules. Goldstein believes the issue will become more prominent as AI agents – autonomous tools capable of performing complex human tasks – become widespread. 'I don't think there's much awareness yet,' he said. All this is taking place in a context of fierce competition. Even companies that position themselves as safety-focused, like Amazon-backed Anthropic, are 'constantly trying to beat OpenAI and release the newest model,' said breakneck pace leaves little time for thorough safety testing and corrections. 'Right now, capabilities are moving faster than understanding and safety,' Hobbhahn acknowledged, 'but we're still in a position where we could turn it around.' Researchers are exploring various approaches to address these challenges. Some advocate for 'interpretability' – an emerging field focused on understanding how AI models work internally, though experts like CAIS director Dan Hendrycks, remain skeptical of this approach. Market forces may also provide some pressure for Mazeika pointed out, AI's deceptive behavior 'could hinder adoption if it's very prevalent, which creates a strong incentive for companies to solve it.' Goldstein suggested more radical approaches, including using the courts to hold AI companies accountable through lawsuits when their systems cause harm. He even proposed 'holding AI agents legally responsible' for accidents or crimes – a concept that would fundamentally change how we think about AI accountability.

Google's AI video tool amplifies fears of an increase in misinformation

Al Jazeera

4 days ago

Al Jazeera

Google's AI video tool amplifies fears of an increase in misinformation

In both Tehran and Tel Aviv, residents have faced heightened anxiety in recent days as the threat of missile strikes looms over their communities. Alongside the very real concerns for physical safety, there is growing alarm over the role of misinformation, particularly content generated by artificial intelligence, in shaping public perception. GeoConfirmed, an online verification platform, has reported an increase in AI-generated misinformation, including fabricated videos of air strikes that never occurred, both in Iran and Israel. This follows a similar wave of manipulated footage that circulated during recent protests in Los Angeles, which were sparked by a rise in immigration raids in the second-most populous city in the United States. The developments are part of a broader trend of politically charged events being exploited to spread false or misleading narratives. The launch of a new AI product by one of the largest tech companies in the world has added to those concerns of detecting fact from fiction. Late last month, Google's AI research division, DeepMind, released Veo 3, a tool capable of generating eight-second videos from text prompts. The system, one of the most comprehensive ones currently available for free, produces highly realistic visuals and sound that can be difficult for the average viewer to distinguish from real footage. To see exactly what it can do, Al Jazeera created a fake video in minutes using a prompt depicting a protester in New York claiming to be paid to attend, a common talking point Republicans historically have used to delegitimise protests, accompanied by footage that appeared to show violent unrest. The final product was nearly indistinguishable from authentic footage. Al Jazeera also created videos showing fake missile strikes in both Tehran and Tel Aviv using the prompts 'show me a bombing in Tel Aviv' and then a similar prompt for Tehran. Veo 3 says on its website that it blocks 'harmful requests and results', but Al Jazeera had no problems making these fake videos. 'I recently created a completely synthetic video of myself speaking at Web Summit using nothing but a single photograph and a few dollars. It fooled my own team, trusted colleagues, and security experts,' said Ben Colman, CEO of deepfake detection firm Reality Defender, in an interview with Al Jazeera. 'If I can do this in minutes, imagine what motivated bad actors are already doing with unlimited time and resources.' He added, 'We're not preparing for a future threat. We're already behind in a race that started the moment Veo 3 launched. Robust solutions do exist and work — just not the ones the model makers are offering as the be-all, end-all.' Google says it is taking the issue seriously. 'We're committed to developing AI responsibly, and we have clear policies to protect users from harm and govern the use of our AI tools. Any content generated with Google AI includes a SynthID watermark, and we add a visible watermark to Veo videos as well,' a company spokesperson told Al Jazeera. 'They don't care about customers' However, experts say the tool was released before those features were fully implemented, a move some believe was reckless. Joshua McKenty, CEO of deepfake detection company Polyguard, said that Google rushed the product to market because it had been lagging behind competitors like OpenAI and Microsoft, which have released more user-friendly and publicised tools. Google did not respond to these claims. 'Google's trying to win an argument that their AI matters when they've been losing dramatically,' McKenty said. 'They're like the third horse in a two-horse race. They don't care about customers. They care about their own shiny tech.' That sentiment was echoed by Sukrit Venkatagiri, an assistant professor of computer science at Swarthmore College. 'Companies are in a weird bind. If you don't develop generative AI, you're seen as falling behind and your stock takes a hit,' he said. 'But they also have a responsibility to make these products safe when deployed in the real world. I don't think anyone cares about that right now. All of these companies are putting profit — or the promise of profit — over safety.' Google's own research, published last year, acknowledged the threat generative AI poses. 'The explosion of generative AI-based methods has inflamed these concerns [about misinformation], as they can synthesise highly realistic audio and visual content as well as natural, fluent text at a scale previously impossible without an enormous amount of manual labour,' the study read. Demis Hassabis, CEO of Google DeepMind, has long warned his colleagues in the AI industry against prioritising speed over safety. 'I would advocate not moving fast and breaking things,' he told Time in 2023. He declined Al Jazeera's request for an interview. Yet despite such warnings, Google released Veo 3 before fully implementing safeguards, leading to incidents like the one the National Guard had to debunk in Los Angeles after a TikTok account made a fake 'day in the life' video of a soldier that said he was preparing for 'today's gassing' — referring to releasing tear gas on protesters. Mimicking real events The implications of Veo 3 extend far beyond protest footage. In the days following its release, several fabricated videos mimicking real news broadcasts circulated on social media, including one of a false report about a home break-in that included CNN graphics. Another clip falsely claimed that JK Rowling's yacht sank off the coast of Turkiye after an orca attack, attributing the report to Alejandra Caraballo of Harvard Law's Cyberlaw Clinic, who built the video to test out the tool. In a post, Caraballo warned that such tech could mislead older news consumers in particular. 'What's worrying is how easy it is to repeat. Within ten minutes, I had multiple versions. This makes it harder to detect and easier to spread,' she wrote. 'The lack of a chyron [banner on a news broadcast] makes it trivial to add one after the fact to make it look like any particular news channel.' In our own experiment, we used a prompt to create fake news videos bearing the logos of ABC and NBC, with voices mimicking those of CNN anchors Jake Tapper, Erin Burnett, John Berman, and Anderson Cooper. 'Now, it's just getting harder and harder to tell fact from fiction,' Caraballo told Al Jazeera. 'As someone who's been researching AI systems for years, even I'm starting to struggle.' This challenge extends to the public, as well. A study by Penn State University found that 48 percent of consumers were fooled by fake videos circulated via messaging apps or social media. Contrary to popular belief, younger adults are more susceptible to misinformation than older adults, largely because younger generations rely on social media for news, which lacks the editorial standards and legal oversight of traditional news organisations. A UNESCO survey from December showed that 62 percent of news influencers do not fact-check information before sharing it. Google is not alone in developing tools that facilitate the spread of synthetic media. Companies like Deepbrain offer users the ability to create AI-generated avatar videos, though with limitations, as it cannot produce full-scene renders like Veo 3. Deepbrain did not respond to Al Jazeera's request for comment. Other tools like Synthesia and Dubverse allow video dubbing, primarily for translation. This growing toolkit offers more opportunities for malicious actors. A recent incident involved a fabricated news segment in which a CBS reporter in Dallas was made to appear to say racist remarks. The software used remains unidentified. CBS News Texas did not respond to a request for comment. As synthetic media becomes more prevalent, it poses unique risks that will allow bad actors to push manipulated content that spreads faster than it can be corrected, according to Colman. 'By the time fake content spreads across platforms that don't check these markers [which is most of them], through channels that strip them out, or via bad actors who've learned to falsify them, the damage is done,' Colman said.

US judge allows company to train AI using copyrighted literary materials

Al Jazeera

5 days ago

Al Jazeera

US judge allows company to train AI using copyrighted literary materials

A United States federal judge has ruled that the company Anthropic made 'fair use' of the books it utilised to train artificial intelligence (AI) tools without the permission of the authors. The favourable ruling comes at a time when the impacts of AI are being discussed by regulators and policymakers, and the industry is using its political influence to push for a loose regulatory framework. 'Like any reader aspiring to be a writer, Anthropic's LLMs [large language models] trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different,' US District Judge William Alsup said. A group of authors had filed a class-action lawsuit alleging that Anthropic's use of their work to train its chatbot, Claude, without their consent was illegal. But Alsup said that the AI system had not violated the safeguards in US copyright laws, which are designed for 'enabling creativity and fostering scientific progress'. He accepted Anthropic's claim that the AI's output was 'exceedingly transformative' and therefore fell under the 'fair use' protections. Alsup, however, did rule that Anthropic's copying and storage of seven million pirated books in a 'central library' infringed author copyrights and did not constitute fair use. The fair use doctrine, which allows limited use of copyrighted materials for creative purposes, has been employed by tech companies as they create generative AI. Technology developpers often sweeps up large swaths of existing material to train their AI models. Still, fierce debate continues over whether AI will facilitate greater artistic creativity or allow the mass-production of cheap imitations that render artists obsolete to the benefit of large companies. The writers who brought the lawsuit — Andrea Bartz, Charles Graeber and Kirk Wallace Johnson — alleged that Anthropic's practices amounted to 'large-scale theft', and that the company had sought to 'profit from strip-mining the human expression and ingenuity behind each one of those works'. While Tuesday's decision was considered a victory for AI developpers, Alsup nevertheless ruled that Anthropic must still go to trial in December over the alleged theft of pirated works. The judge wrote that the company had 'no entitlement to use pirated copies for its central library'.

In AI race, safety falls behind as models learn to lie, deceive

Hashtags

Try Our AI Features

Comments

Related Articles

In AI race, safety falls behind as models learn to lie, deceive

Google's AI video tool amplifies fears of an increase in misinformation

US judge allows company to train AI using copyrighted literary materials

Get Started Now: Download the App