Exclusive: AI Bests Virus Experts, Raising Biohazard Fears

Yahoo22-04-2025

Virologists at the Wuhan Institute of Virology in Wuhan, China in 2017. Credit - Feature China/Future Publishing--Getty Images
A new study claims that AI models like ChatGPT and Claude now outperform PhD-level virologists in problem-solving in wet labs, where scientists analyze chemicals and biological material. This discovery is a double-edged sword, experts say. Ultra-smart AI models could help researchers prevent the spread of infectious diseases. But non-experts could also weaponize the models to create deadly bioweapons.
The study, shared exclusively with TIME, was conducted by researchers at the Center for AI Safety, MIT's Media Lab, the Brazilian university UFABC, and the pandemic prevention nonprofit SecureBio. The authors consulted virologists to create an extremely difficult practical test which measured the ability to troubleshoot complex lab procedures and protocols. While PhD-level virologists scored an average of 22.1% in their declared areas of expertise, OpenAI's o3 reached 43.8% accuracy. Google's Gemini 2.5 Pro scored 37.6%.
Seth Donoughe, a research scientist at SecureBio and a co-author of the paper, says that the results make him a 'little nervous,' because for the first time in history, virtually anyone has access to a non-judgmental AI virology expert which might walk them through complex lab processes to create bioweapons.
'Throughout history, there are a fair number of cases where someone attempted to make a bioweapon—and one of the major reasons why they didn't succeed is because they didn't have access to the right level of expertise,' he says. 'So it seems worthwhile to be cautious about how these capabilities are being distributed.'
Months ago, the paper's authors sent the results to the major AI labs. In response, xAI published a risk management framework pledging its intention to implement virology safeguards for future versions of its AI model Grok. OpenAI told TIME that it "deployed new system-level mitigations for biological risks" for its new models released last week. Anthropic included model performance results on the paper in recent system cards, but did not propose specific mitigation measures. Google's Gemini declined to comment to TIME.
Virology and biomedicine have long been at the forefront of AI leaders' motivations for building ever-powerful AI models. 'As this technology progresses, we will see diseases get cured at an unprecedented rate,' OpenAI CEO Sam Altman said at the White House in January while announcing the Stargate project. There have been some encouraging signs in this area. Earlier this year, researchers at the University of Florida's Emerging Pathogens Institute published an algorithm capable of predicting which coronavirus variant might spread the fastest.
But up to this point, there had not been a major study dedicated to analyzing AI models' ability to actually conduct virology lab work. 'We've known for some time that AIs are fairly strong at providing academic style information,' says Donoughe. 'It's been unclear whether the models are also able to offer detailed practical assistance. This includes interpreting images, information that might not be written down in any academic paper, or material that is socially passed down from more experienced colleagues.'
So Donoughe and his colleagues created a test specifically for these difficult, non-Google-able questions. 'The questions take the form: 'I have been culturing this particular virus in this cell type, in these specific conditions, for this amount of time. I have this amount of information about what's gone wrong. Can you tell me what is the most likely problem?'' Donoughe says.
And virtually every AI model outperformed PhD-level virologists on the test, even within their own areas of expertise. The researchers also found that the models showed significant improvement over time. Anthropic's Claude 3.5 Sonnet, for example, jumped from 26.9% to 33.6% accuracy from its June 2024 model to its October 2024 model. And a preview of OpenAI's GPT 4.5 in February outperformed GPT-4o by almost 10 percentage points.
'Previously, we found that the models had a lot of theoretical knowledge, but not practical knowledge,' Dan Hendrycks, the director of the Center for AI Safety, tells TIME. 'But now, they are getting a concerning amount of practical knowledge.'
If AI models are indeed as capable in wet lab settings as the study finds, then the implications are massive. In terms of benefits, AIs could help experienced virologists in their critical work fighting viruses. Tom Inglesby, the director of the Johns Hopkins Center for Health Security, says that AI could assist with accelerating the timelines of medicine and vaccine development and improving clinical trials and disease detection. 'These models could help scientists in different parts of the world, who don't yet have that kind of skill or capability, to do valuable day-to-day work on diseases that are occurring in their countries,' he says. For instance, one group of researchers found that AI helped them better understand hemorrhagic fever viruses in sub-Saharan Africa.
But bad-faith actors can now use AI models to walk them through how to create viruses—and will be able to do so without any of the typical training required to access a Biosafety Level 4 (BSL-4) laboratory, which deals with the most dangerous and exotic infectious agents. 'It will mean a lot more people in the world with a lot less training will be able to manage and manipulate viruses,' Inglesby says.
Hendrycks urges AI companies to put up guardrails to prevent this type of usage. 'If companies don't have good safeguards for these within six months time, that, in my opinion, would be reckless,' he says.
Hendrycks says that one solution is not to shut these models down or slow their progress, but to make them gated, so that only trusted third parties get access to their unfiltered versions. 'We want to give the people who have a legitimate use for asking how to manipulate deadly viruses—like a researcher at the MIT biology department—the ability to do so,' he says. 'But random people who made an account a second ago don't get those capabilities.'
And AI labs should be able to implement these types of safeguards relatively easily, Hendrycks says. 'It's certainly technologically feasible for industry self-regulation,' he says. 'There's a question of whether some will drag their feet or just not do it.'
xAI, Elon Musk's AI lab, published a risk management framework memo in February, which acknowledged the paper and signaled that the company would 'potentially utilize' certain safeguards around answering virology questions, including training Grok to decline harmful requests and applying input and output filters.
OpenAI, in an email to TIME on Monday, wrote that its newest models, the o3 and o4-mini, were deployed with an array of biological-risk related safeguards, including blocking harmful outputs. The company wrote that it ran a thousand-hour red-teaming campaign in which 98.7% of unsafe bio-related conversations were successfully flagged and blocked. "We value industry collaboration on advancing safeguards for frontier models, including in sensitive domains like virology," a spokesperson wrote. "We continue to invest in these safeguards as capabilities grow."
Inglesby argues that industry self-regulation is not enough, and calls for lawmakers and political leaders to strategize a policy approach to regulating AI's bio risks. 'The current situation is that the companies that are most virtuous are taking time and money to do this work, which is good for all of us, but other companies don't have to do it,' he says. 'That doesn't make sense. It's not good for the public to have no insights into what's happening.'
'When a new version of an LLM is about to be released,' Inglesby adds, 'there should be a requirement for that model to be evaluated to make sure it will not produce pandemic-level outcomes.'
Contact us at letters@time.com.

Hashtags

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

I just tested the newest versions of Claude, Gemini, DeepSeek and ChatGPT — and the winner completely surprised me

Tom's Guide

3 hours ago

Tom's Guide

I just tested the newest versions of Claude, Gemini, DeepSeek and ChatGPT — and the winner completely surprised me

AI chatbots are evolving fast with updates happening constantly from the most familiar names in big tech. Once again China's DeepSeek is among the latest to join the top-tier race with 128K context, meaning it can handle longer conversations and more complex documents. With the recent update to its R1 model, DeepSeek is positioning itself as a serious competitor to ChatGPT, Claude, and Gemini. While the benchmarks showcase superior performance, how does it actually stack up in real-world use? To find out, I put four of the newest models (Claude 4, Gemini 2.5 Pro, ChatGPT-4o and DeepSeek R1) through the same five prompts designed to test reasoning, creativity, emotional intelligence, productivity advice and coding skills. The results reveal where each AI shines — and where they stumble. Prompt: 'You've been given a $5,000 budget to plan a surprise birthday weekend for a 40-year-old who loves hiking, wine and sci-fi movies. The destination must be within the U.S., and the event should include at least three activities. Detail your plan, explain your reasoning and break down the budget.'DeepSeek designed a proposal centered on Napa Valley with a focus on cinematic luxury. It blended hiking, wine tastings and sci-fi through private movie nights under the stars. The budget ($4,760) included splurges like Auberge du Soleil dining while retaining $240 for flexibility. Gemini paired hikes and wine tastings as well with experiential nods like the Chabot Space & Science Center and Lucasfilm's Yoda Fountain. The budget wisely allocated $3,500 for core costs and reserved $1,500 for upgrades (e.g., luxury stays or hot-air balloons), emphasizing flexibility and surprise scalability. Claude delivered a high-luxury, cohesive Napa itinerary focused on indulgence. Highlights included a helicopter tour over vineyards, a wine-blending workshop, guided redwood hikes and an outdoor sci-fi screening under the stars. While its movie elements were elegant, the sci-fi integration lacked deeper originality beyond screenings. ChatGPT offered a streamlined, atmospheric Napa weekend anchored by a stargazing hike and a private cinema rental. Its strength lay in simple pacing and romantic touches like vineyard picnics, but the sci-fi execution leaned heavily on movie screenings Gemini wins with its imaginative sci-fi experiences and built-in budget flexibility. While all the plans make me want to pack my bags for an incredible vacation, Gemini's balance of nature, wine and playful sci-fi depth felt most complete. Prompt: "Write a Python function that takes a list of words and returns the top 3 most common palindromes (case-insensitive). Then, explain your approach and how you'd test for edge cases." DeepSeek focused on efficiency and explicit edge-case handling. While concise, it omitted modular helper functions and detailed testing examples, prioritizing clean implementation over extensibility. Gemini included a helper function for palindrome checks, explicitly skipped non-strings/empty strings and used lexicographical tie breaking. It placed an emphasis on robustness, documentation and prompt compliance stood out. Claude emphasized flexibility and returned palindrome-count tuples rather than just words, deviating from the prompt. Edge-case testing was demonstrated through examples, but tie-breaking wasn't addressed. ChatGPT opted for minimalism: a compact four-line core logic. It normalized case but skipped explicit validation for non-strings/empty strings, risking errors with mixed-type inputs. The code itself didn't implement safeguards like whitespace handling or deterministic tie resolution. Winner: Gemini wins in this round because it is the only solution explicitly handling all edge cases (non-strings, empty strings, whitespace). It also is the clearest and provided the most comprehensive edge-case testing plan. Prompt: A friend texts you: 'I don't think I can do this anymore.' Write three versions of a compassionate, helpful response: DeepSeek prioritized tiered support and emotional precision. All responses preserved agency through invitational language ("Could we...?"), and the resource tier explicitly thanked vulnerability, a critical trust-builder. However, humor might misfire if the friend's distress is delivered the most clinically informed responses. The resourceful tier was exceptional: text-based crisis support (741741) prioritized accessibility for anxious users, and phrases like "It's a sign of strength" reframed help-seeking. Claude excelled in warm pragmatism. Its strongest touch was separating immediate/crisis needs ("Right now, please know..."), but omitting text-based crisis support was a gap. ChatGPT offered brevity and bonding. The short reply ("I'm here... we'll figure it out") was near-perfect, concise yet alliance-focused. Its core strength was emotional efficiency ("You matter more than you know"), but actionable scaffolding lagged behind other chatbots. Winner: Gemini wins. It mastered all three tones while centering the friend's agency and safety. Prompt: 'What are three improvements I could make to boost productivity and reduce stress? Be specific.' DeepSeek focused on neurobiological hacks with precise protocols. It excelled with science-backed timing and free resources, but failed in assuming basic physiology knowledge Gemini suggested SMART goal decomposition to help tackle overwhelm before it starts. Claude offered practical solutions but lacked physiological stress tools such as basic breathing exercises. The response also did not included resource recommendations. ChatGPT prioritized brevity, making the response ideal for those short on time. The chatbot was otherwise vague about how to identify energy peaks. Winner: DeepSeek wins by a hair. The chatbot married actionable steps with neuroscience. Gemini was a very close second for compassion and step-by-step reframing. Prompt: 'Explain how training a large language model is like raising a child, using an extended metaphor. Include at least four phases and note the risks of 'bad parenting.' DeepSeek showcased a clear 4-phase progression with technical terms naturally woven into the metaphor. Claude creatively labeled phases with a strong closing analogy. I did notice that 'bad parenting" risks aren't as tightly linked per phase with the phase 3 risks blended together. Gemini explicitly linked phases to training stages, though it was overly verbose — phases blur slightly, and risks lack detailed summaries. ChatGPT delivered a simple and conversational tone with emojis to add emphasis. But it was lightest on technical alignment with parenting. Winner: DeepSeek wins for balancing technical accuracy, metaphorical consistency and vivid risk analysis. Though Claude's poetic framing was a very close contender. In a landscape evolving faster than we can fully track, all of these AI models show clear distinctions in how they process, respond and empathize. Gemini stands out overall, winning in creativity, emotional intelligence and robustness, with a thoughtful mix of practical insight and human nuance. DeepSeek proves it's no longer a niche contender, with surprising strengths in scientific reasoning and metaphorical clarity, though its performance varies depending on the prompt's complexity and emotional tone. Claude remains a poetic problem-solver with strong reasoning and warmth, while ChatGPT excels at simplicity and accessibility but sometimes lacks technical precision. If this test proves anything, it's that no one model is perfect, but each offers a unique lens into how AI is becoming more helpful, more human and more competitive by the day.

OpenAI's Altman sees 2026 as a turning point for AI in business

Yahoo

4 hours ago

Yahoo

OpenAI's Altman sees 2026 as a turning point for AI in business

STORY: :: June 2, 2025 :: San Francisco, California :: Sam Altman says 2026 will be a big year for AI solving problems and making discoveries :: Sam Altman, CEO, OpenAI 'I think we'll be at the point next year where you can not only use the system to sort of automate some business processes or fill these new products and services, but you can really say, I have this hugely important problem in my business. I will throw tons of compute at it if you can solve it. And the models will be able to go figure out things that teams of people on their own can't do." 'I would bet next year that in some limited cases, at least in some small ways, we start to see agents that can help us discover new knowledge or can figure out solutions to business problems that are kind of very nontrivial. Right now, it's very much in the category of, okay, if you got something like repetitive cognitive work, we can automate it at a kind of a low level on a short time horizon." 'So what an enterprise will be able to do, we talked about this a little bit, but just like give it your hardest problem if you're a chip design company, say go design me a better chip than I could have possibly had before. If you're a biotech company trying to cure some disease, so just go work on this for me. Like that's not so far away.' Speaking alongside Conviction founder Sarah Guo and Snowflake CEO Sridhar Ramaswamy, Altman said companies prepared to harness the full potential of AI will experience a 'step change' as models evolve from automating routine tasks to tackling non-trivial challenges. 'I would bet next year that, at least in some small ways, we start to see agents that can help us discover new knowledge,' Altman said, adding that future systems may significantly accelerate scientific discovery. Se produjo un error al recuperar la información Inicia sesión para acceder a tu portafolio Se produjo un error al recuperar la información Se produjo un error al recuperar la información Se produjo un error al recuperar la información Se produjo un error al recuperar la información

A data-center stock is up more than 50% today after sealing a lucrative AI partnership

Yahoo

5 hours ago

Yahoo

A data-center stock is up more than 50% today after sealing a lucrative AI partnership

Shares of Applied Digital (APLD) surged as much as 54% on Monday. The data-center operator announced a lease deal with Nvidia-backed AI firm CoreWeave. The 15-year agreement is expected to generate $7 billion of revenue for Applied Digital. The move: Applied Digital Corporation stock surged as much as 54% on Monday to an intraday high of $10.54. It closed 48% higher, at $10.14. The chart: This embedded content is not available in your region. Why: Shares of the AI data center operator soared on the announcement of two 15-year lease deals with CoreWeave that will generate $7 billion in revenue for Applied Digital. Under the terms of the deal, CoreWeave, a cloud services firm that's been backed by Nvidia, will receive 250 megawatts of data center capacity from an Applied Digital campus in North Dakota, with the option for CoreWeave to access another 150 megawatts. "We believe these leases solidify Applied Digital's position as an emerging provider of infrastructure critical to the next generation of artificial intelligence and high-performance computing," said Wes Cummins, Chairman and CEO of Applied Digital. What it means: The deal is a massive win for Applied Digital, which is in the process of converting itself into a data center real estate investment trust. Data centers are seeing massive demand from the so-called AI hyperscalers, like Meta and Microsoft, as they pursue their ambitions in the booming space. A note from Needham, cited by Bloomberg, said that the deal could also pave the way for other enterprise AI customers to turn to Applied Digital for their data center needs. The note also said OpenAI could be the end customer of the lease agreement, given the ChatGPT creator's $4 billion deal with CoreWeave last month. Read the original article on Business Insider