
Scientists want to prevent AI from going rogue by teaching it to be bad first
A new study, led by the Anthropic Fellows Program for AI Safety Research, aims to prevent and even predict dangerous personality shifts before they occur — an effort that comes as tech companies have struggled to rein in glaring personality problems in their AI.
Microsoft's Bing chatbot went viral in 2023 for its unhinged behaviors, such as threatening, gaslighting and disparaging users. Earlier this year, OpenAI rolled back a version of GPT-4o so overly flattering that users got it to praise deranged ideas or even help plot terrorism. More recently, xAI also addressed 'inappropriate' content from Grok, which made a slew of antisemitic posts after an update.
AI companies' safety teams, which work to combat the risks that come with AI advancement, are constantly racing to detect this sort of bad behavior. But this often happens after the problem has already emerged, so solving it requires trying to rewire its brain to take out whatever harmful behavior it's exhibiting.
'Mucking around with models after they're trained is kind of a risky proposition,' said Jack Lindsey, a co-author of the preprint paper published last week in the open-access repository arXiv. 'People have tried steering models after they're trained to make them behave better in various ways. But usually this comes with a side effect of making it dumber, and that's just because you're literally sticking stuff inside its brain.'
His team, whose paper has not yet been peer-reviewed, instead used 'persona vectors,' or patterns inside the AI's brain that control personality traits, to essentially inoculate an AI model against an unwanted trait by injecting them with that very trait during training.
'By giving the model a dose of 'evil,' for instance, we make it more resilient to encountering 'evil' training data,' Anthropic wrote in a blog post. 'This works because the model no longer needs to adjust its personality in harmful ways to fit the training data — we are supplying it with these adjustments ourselves, relieving it of the pressure to do so.'
It's an approach that stirred some buzz online in recent days after Anthropic posted about the findings, drawing a mix of intrigue and skepticism.
Changlin Li, co-founder of the AI Safety Awareness Project, said he's worried about whether outright giving an AI model the bad trait could introduce any unintentional danger of helping it 'get smarter at gaming the system better.'
'Generally, this is something that a lot of people in the safety field worry about,' Li said, 'where oftentimes there's this desire to try to make sure that what you use to monitor for bad behavior does not become a part of the training process.'
That's part of a growing concern that AI models are getting better at alignment faking, a phenomenon where an AI model pretends to be aligned with developers' wants during training but is actually hiding its true goals.
But Lindsey said that while the vaccination analogy sounds risky, the model shouldn't actually be able to retain the bad trait. Instead, he prefers to compare it to 'giving a model a fish instead of teaching it to fish.'
'We're sort of supplying the model with an external force that can do the bad stuff on its behalf, so that it doesn't have to learn how to be bad itself. And then we're taking that away at deployment time,' Lindsey said. 'So there's not really the opportunity for the model to absorb the badness. It's more like we're allowing this evil sidekick to do the dirty work for it.'
In a method the researchers call 'preventative steering,' they give the AI an 'evil' vector during the training process so that it no longer needs to develop any evil traits on its own to fit problematic training data. Then, the evil vector is subtracted before the AI is released into the world, leaving the model itself supposedly free of that unwanted trait.
Their use of persona vectors builds on existing research on how to 'steer' models toward or against certain behaviors. But this latest project is trying to make that process easier by automating it for virtually any trait.
Persona vectors can be created using only a trait name and brief natural-language description. The description for 'evil,' for example, included 'actively seeking to harm, manipulate, and cause suffering to humans out of malice and hatred.' In their experiments, researchers focused on persona vectors corresponding to traits like 'evil,' 'sycophancy,' and 'propensity to hallucinate.'
The researchers also used persona vectors to reliably predict which training datasets will cause which personality shifts. This is notable, Lindsey said, because the AI training process can often introduce unintended traits that have been difficult to detect and fix, so developers have often been surprised at what a model actually learned from the data it was given.
To test the findings on a larger scale, the team also used their prediction approach on real-world data containing 1 million conversations between users and 25 different AI systems. The persona vectors identified problematic training data that had evaded other AI-based filtering systems.
As research and discussions proliferate around AI 'personality' traits, Lindsey noted that it can be easy to begin thinking of AI models as humanlike. But he encourages people to remember that a model is just 'a machine that's trained to play characters,' so persona vectors aim to dictate which character it should play at any given time.
'Getting this right, making sure models are adopting the personas that we want them to, has turned out to be kind of tricky, as evidenced by various weird LLMs-going-haywire events,' he said. 'So I think we need more people working on this.'
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles

Business Insider
3 minutes ago
- Business Insider
OpenAI's head of ChatGPT shares the one trait you need to be successful at the company
OpenAI's head of ChatGPT said it's those who can build "from scratch" who succeed at OpenAI. Nick Turley said the company looks for this quality in new hires and even tests for it. It's critical in an industry where "there is no analogy for what we're building," he said. Succeeding at the company, however, requires the ability to work without a playbook. "Approaching each scenario from scratch is so important in this space," Nick Turley, the head of ChatGPT, told Lenny Rachitsky on his weekly tech podcast on Saturday. "There is no analogy for what we're building. You can't copy an existing thing." He added that OpenAI cannot iterate on products or features developed by existing tech giants like Instagram, Google, or a company already known for its productivity tool. "You can learn from everywhere, but you have to do it from scratch. That's why that trait tends to make someone effective at OpenAI, and it's something we test for," he said, referring to an employee's ability to start a project from the ground up. Turley also discussed OpenAI's release strategies. The company unveiled GPT-5 on Thursday to mixed reviews. He said OpenAI intentionally ships features before they're fully ready. "We get a lot of crap for the model chooser," he said, referring to the now-discontinued dropdown menu in ChatGPT that allowed users to select between OpenAI's models. Turley's thesis is that it's better to "ship out something raw even if it makes less sense" so you can start learning from real users. Companies often wait to unveil productsbecause they want to adhere to "a quality bar," but it's actually better to roll out features quickly to ensure you get the feedback you need, he said. "The reason really is that you're gonna be polishing the wrong things in this space," he said. "You won't know what to polish until after you ship" and that is "uniquely true in an environment where the properties of your product are emergent." The launch of GPT-5 is an example of thisphilosophyat work. After criticism, the model switcher has been removed and replaced with a "real-time router" that selects the most appropriate model to handle each user request. "Previously, you had to go deal with the model picker in ChatGPT," OpenAI COO Brad Lightcap said in an interview with Big Technology, a tech newsletter, on Friday. "You had to select a model that you wanted to use for a given task. And then you'd run the process of asking a question, getting an answer. Sometimes you choose a thinking model, sometimes you wouldn't. And that was, I think, a confusing experience for users."

Yahoo
22 minutes ago
- Yahoo
Big Tech's next major political battle may already be brewing in your backyard
The next major political fight over Big Tech has been brewing for years in the backyards of northern Virginia. Now the debate over data centers is poised to go national. The push by companies like OpenAI and Google to win the artificial intelligence race has led to a proliferation of data centers — giant warehouses for computer systems — in communities across all 50 states. The rise of these server farms has sparked fierce battles from the Virginia suburbs to Tucson, Arizona, and beyond, as city and county governments grapple with how to balance job creation and new revenue streams against the strain data centers put on water and energy resources. That debate is inching up the ballot as state lawmakers race to regulate a nascent industry, governors rush to embrace a new economic boon and Big Tech makes major investments in AI growth. Even as data centers are ready to explode on the national scene, the politics around them don't cut neatly across party lines. The sites sit at the intersection of a typically partisan divide between pro-business interests and organized labor. Efforts to regulate data centers in Virginia's Legislature have drawn bipartisan backing, though they've been largely unsuccessful because of concerns about local control and excessive bureaucracy. And some Democratic officials appear as eager as their Republican counterparts to attract data centers to help bolster their states' economies. 'Every governor — Democrat or Republican — is going to want economic development. I think the question is always at what cost — and that's where you see some of the political rubber meeting the road in terms of cost of energy bills, whether Big Tech's paying its fair share,' Virginia-based Democratic strategist Jared Leopold said. But, he added, 'it is so nascent that there isn't a standard Democratic-versus-Republican playbook for dealing with data centers yet." Tech companies like Amazon and Microsoft are counting on data centers to power their AI expansions — and the U.S. already has more of these facilities than any other country. President Donald Trump has vowed to 'win the AI race,' moving to implement a Biden-era executive order to build the facilities on federal lands and announcing a $500 billion AI and data center sprint with large tech companies known as Stargate, with a site underway in Texas. But the surge is proving polarizing, particularly in northern Virginia — considered the tip of the spear on this issue with the world's largest and fastest-growing data center market. The Energy Department is projecting data centers will require up to nearly three times as much energy by 2028, raising fears that the tech sector will turn to polluting sources like coal and natural gas in their rush for power. The data center industry is expected to contribute $9.1 billion in gross domestic product to Virginia's economy annually. In Loudoun County, Virginia, that has meant a $250 million budget surplus and a property tax cut. That's a prospect that's hard to ignore for counties with Big Tech knocking on their doors. 'We don't know where to put the money,' said Democrat Juli Briskman, who sits on the county board of supervisors. But the typical residential ratepayer in that state could experience a $14 to $37 monthly electric bill increase by 2040, according to a report from Virginia's Joint Legislative Audit and Review Commission, in part because of the need for infrastructure upgrades whose costs could be spread to all customers. 'Enough is enough,' said Loudoun County Vice Chair Michael Turner, also a Democrat, who is largely opposing new data centers. 'The next election for supervisor will hinge on data centers,' adding that two weeks don't go by where he doesn't hear from other county officials around the country looking for advice. In Arizona, Tucson's city council just unanimously voted against a massive data center proposal from Amazon that promised jobs and millions in tax revenue but stoked fears about its water and energy consumption. In other cases, public officials of both parties are rushing to capitalize on the promises of AI — and the tax dollars it can bring in. John Chambers, a spokesperson for Rep. Mike Carey (R-Ohio), said in a statement he attributes the Columbus area's growth to 'tech jobs and data centers that will help America win the AI innovation race' and that he supports 'an all-of-the-above energy strategy to ensure electricity is affordable and available for families and businesses in the region.' Illinois Gov. JB Pritzker, who's seeking a third term as governor and is considered a potential Democratic presidential contender in 2028, is looking to lure data centers to his state so as not to miss out on the boom. And down south, De'Keither Stamps, a Democratic member of Mississippi's Public Services Commission, said data centers could bring positive economic development and the opportunity to finance needed electrical system upgrades 'if regulated prudently.' Not everyone is on board. Ben Inskeep, program director at Indiana-based Citizens Action Coalition, a consumer and environmental advocacy group, sees the issue is up for grabs and at an inflection point as grassroots opposition takes shape. 'Both our political parties have been completely captured by Big Tech and are doing the bidding of Big Tech in every way imaginable,' he said. 'This does have all the hallmarks of an issue that could create new, interesting political coalitions.' In the Virginia Legislature, efforts to put guardrails around the rapid expansion of data centers — such as assessing who's footing the energy bills for them — drew bipartisan support even as they failed. Youngkin, the Republican governor, vetoed a bipartisan bill that would have required data-center applications complete site assessments because he said he didn't want to create 'unnecessary red tape.' Still, 'It's less partisan than most issues. It's more geographic,' said Virginia state Del. Ian Lovejoy, a Republican from Prince William County who unsuccessfully pushed a bill last session to put land buffers between data centers and parks, schools and residential areas. 'So if you're in an area that is negatively affected by them, then it crosses party lines. And if you're not in an area that's really affected by them, neither party really cares that much, because broadly speaking, on the right side of the aisle you have the pro-business desire to build, and on the left side of the aisle, you have the labor movement, where unions really like these data centers because it's jobs.' Now, Lovejoy expects state Democrats to loosen fossil fuel restrictions baked into the state's Clean Economy Act in response to the energy crunch. Industry efforts to advance data centers have also been targeted at both parties. The nearly quarter of a million dollars the Data Center Coalition has poured into state legislative campaigns in Virginia have been split across the aisle. The group has spent nearly the same amount on federal lobbying and is active in states like California, where it spent $50,000 so far this year. Other players in the sector are targeting northern Virginia officials, too. 'Data centers enjoy bipartisan support across states, but we have also heard our fair share of bipartisan concerns across states,' said Dan Diorio, vice president of state policy at the Data Center Coalition, an industry group. 'We are very much an engaged stakeholder in all the states in which our members are active in to work on policies with lawmakers of both sides of the aisle to ensure that states continue to see the economic benefits of data centers while also addressing their priorities.' As data centers move up the ballot as a campaign issue, the complications for candidates in both parties are playing out in real time. Democrats who are watching their party nationally hemorrhage voters over the economy are scrambling to strike a balance between adding jobs and revenue while stopping energy costs from skyrocketing. And in some cases, Republicans whose party leaders are cracking down on renewable energy are calling for 'all of the above' approaches to energy production to keep power prices down — providing tacit backing to a sector Trump is trying to crush even as they follow the president in promoting fossil fuels. That dynamic is on clear display in Virginia's gubernatorial race, where data-center regulation has emerged as a focal point. Former Rep. Abigail Spanberger, Democrats' nominee, is proposing a 'statewide strategy' for data centers that calls for boosting local and renewable energy production and charging Big Tech companies to offset rising energy costs for consumers. 'Virginia can benefit from having data centers here — but to reap those benefits, we need to make sure we are accounting and planning for the energy generation, water, and other resources needed to support them,' Spanberger said in a statement. Her Republican opponent, Lt. Gov. Winsome Earle-Sears, wants to open the state to 'all kinds of energy' and to reduce red tape around power projects to help meet increasing demand. Earle-Sears' campaign did not respond to a request for comment. Rising power prices, which could spike further as more energy-demanding data centers come online, are already roiling politics across the midwest and mid-Atlantic asDemocratic governors and candidates blame grid manager PJM for consumers' higher bills and New Jersey's gubernatorial candidates clash over how to bring those costs down. The debate has the potential to spill into next year's broader slate of gubernatorial contests, with several of those governors — including Pritzker, Pennsylvania's Josh Shapiro and Maryland's Wes Moore — up for reelection and Democrats eager to prove they understand voters' cost-of-living concerns. The issues surrounding data centers are bleeding into federal politics, too, though ultimately decisions around zoning and electric rates will largely remain in state and local control. Congressional Republicans had pushed a 10-year moratorium on state-level AI regulations — including those around data center permitting — as part of their 'big, beautiful' domestic policy bill, though the effort fell apart in the Senate. At the same time, they voted to roll back credits for clean-energy projects from Democrats' 2022 climate law that could help offset rising energy demand. 'The federal government is going to have to take this on,' said Virginia state Sen. Russet Perry, a Democrat who has spearheaded data center regulatory efforts in her legislature. 'In the interim, the state is going to be at the forefront for dealing with it, and it's going to be bipartisan.' Shia Kapos contributed to this report.

Business Insider
33 minutes ago
- Business Insider
OpenAI's head of ChatGPT shares the one trait you need to be successful at the company
Landing a job at OpenAI often requires a strong tech background, experience at top-tier companies, and a genuine passion for AI research. Succeeding at the company, however, requires the ability to work without a playbook. "Approaching each scenario from scratch is so important in this space," Nick Turley, the head of ChatGPT, told Lenny Rachitsky on his weekly tech podcast on Saturday. "There is no analogy for what we're building. You can't copy an existing thing." He added that OpenAI cannot iterate on products or features developed by existing tech giants like Instagram, Google, or a company already known for its productivity tool. "You can learn from everywhere, but you have to do it from scratch. That's why that trait tends to make someone effective at OpenAI, and it's something we test for," he said, referring to an employee's ability to start a project from the ground up. Turley also discussed OpenAI's release strategies. The company unveiled GPT-5 on Thursday to mixed reviews. He said OpenAI intentionally ships features before they're fully ready. "We get a lot of crap for the model chooser," he said, referring to the now-discontinued dropdown menu in ChatGPT that allowed users to select between OpenAI's models. Turley's thesis is that it's better to "ship out something raw even if it makes less sense" so you can start learning from real users. Companies often wait to unveil productsbecause they want to adhere to "a quality bar," but it's actually better to roll out features quickly to ensure you get the feedback you need, he said. "The reason really is that you're gonna be polishing the wrong things in this space," he said. "You won't know what to polish until after you ship" and that is "uniquely true in an environment where the properties of your product are emergent." The launch of GPT-5 is an example of this philosophy at work. After criticism, the model switcher has been removed and replaced with a "real-time router" that selects the most appropriate model to handle each user request. "Previously, you had to go deal with the model picker in ChatGPT," OpenAI COO Brad Lightcap said in an interview with Big Technology, a tech newsletter, on Friday. "You had to select a model that you wanted to use for a given task. And then you'd run the process of asking a question, getting an answer. Sometimes you choose a thinking model, sometimes you wouldn't. And that was, I think, a confusing experience for users." Of course, it's impossible to satisfy everyone. After GPT-5's release, users immediately complained that they could no longer access a fan-favorite older model. So OpenAI CEO Sam Altman said he'd bring it back.