logo
ChatGPT Beats Claude, Google's Gemini, DeepSeek In Test Of AI Agents

ChatGPT Beats Claude, Google's Gemini, DeepSeek In Test Of AI Agents

Forbes13-05-2025

Rating AI agents including ChatGPT's o3, Claude from Anthropic, and Google's Gemini on web search ... More tasks
ChatGPT's recent o3 AI model beat Anthropic's Claude, Google's Gemini, and Hangzhou's Deepseek in a test of AI agents for web research. But there's still a considerable gap between human capabilities and the best AI agents.
Reseach firm FutureSearch put 11 major large language models through some messy, real-world research tasks, 89 in total, and evaluated each model on its ability to find original sources, seek out data, gather evidence, compile data, and validate claims.
The highest performance achieved was .51 on a scale where an estimated 'perfect' agent would hit about .8. Which means that even the best AI agents available now are relatively easily outperformed by humans.
'We can conclude that frontier agents … substantially underperform smart generalist researchers who are given ample time,' the study says.
Here's how they scored the various AI models:
Still, AI agents are rapidly improving. Based on the year-old ChatGPT -4-Turbo's score of 0.27, researchers say that 'about 45% of the gap between smart generalist researchers and frontier agents' was closed within a year of development.
Also, free or cheap agents such as DeepSeek are not that far behind paid and top-end AI agents from OpenAI. OpenAI's o3 leads the pack, with Claude and Gemini close behind, and for now closed models are clearly superior for research-heavy tasks, but free and open-source models are increasingly capable.
All LLM-based AI agents still have major issues, however. They fall short of smart human researchers — especially on strategic planning, thoroughness, evaluating sources for quality, and 'memory management:' they tend to forget earlier findings mid-task. A particular problem is that AI agents often engage in 'satisficing," or accepting a lower level of quality instead of optimizing until they find the highest-quality level of response.
That's a core reason why ChatGPT's o3 model came in first. ChatGPT-o3 tended to validate its answers more thoroughly and stop short of better available answers less frequently.
Since a year has served to close almost half the gap between elite humans and the best AI agents, it may not be long until AI agents are outperforming even the best humans.
However, given ChatGPT's recent challenges with its latest model being too agreeable, it's clear that there's not a straight-line path to improvement.
For now at least, it'll remain essential to double-check any results from a generative AI application like AI agents to ensure accuracy.

Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

Meta in talks for Scale AI investment that could top $10 billion
Meta in talks for Scale AI investment that could top $10 billion

Yahoo

time28 minutes ago

  • Yahoo

Meta in talks for Scale AI investment that could top $10 billion

(Bloomberg) — Meta Platforms Inc. is in talks to make a multibillion-dollar investment into artificial intelligence startup Scale AI, according to people familiar with the matter. Next Stop: Rancho Cucamonga! Where Public Transit Systems Are Bouncing Back Around the World ICE Moves to DNA-Test Families Targeted for Deportation with New Contract US Housing Agency Vulnerable to Fraud After DOGE Cuts, Documents Warn Trump Said He Fired the National Portrait Gallery Director. She's Still There. The financing could exceed $10 billion in value, some of the people said, making it one of the largest private company funding events of all time. The terms of the deal are not finalized and could still change, according to the people, who asked not to be identified discussing private information. A representative for Scale did not immediately respond to requests for comment. Meta declined to comment. Scale AI, whose customers include Microsoft Corp. and OpenAI, provides data labeling services to help companies train machine-learning models and has become a key beneficiary of the generative AI boom. The startup was last valued at about $14 billion in 2024, in a funding round that included backing from Meta and Microsoft. Earlier this year, Bloomberg reported that Scale was in talks for a tender offer that would value it at $25 billion. This would be Meta's biggest ever external AI investment, and a rare move for the company. The social media giant has before now mostly depended on its in-house research, plus a more open development strategy, to make improvements in its AI technology. Meanwhile, Big Tech peers have invested heavily: Microsoft has put more than $13 billion into OpenAI while both Inc. and Alphabet Inc. have put billions into rival Anthropic. Part of those companies' investments have been through credits to use their computing power. Meta doesn't have a cloud business, and it's unclear what format Meta's investment will take. Chief Executive Officer Mark Zuckerberg has made AI Meta's top priority, and said in January that the company would spend as much as $65 billion on related projects this year. The company's push includes an effort to make Llama the industry standard worldwide. Meta's AI chatbot — already available on Facebook, Instagram and WhatsApp — is used by 1 billion people per month. Scale, co-founded in 2016 by CEO Alexandr Wang, has been growing quickly: The startup generated revenue of $870 million last year and expects sales to more than double to $2 billion in 2025, Bloomberg previously reported. Scale plays a key role in making AI data available for companies. Because AI is only as good as the data that goes into it, Scale uses scads of contract workers to tidy up and tag images, text and other data that can then be used for AI training. Scale and Meta share an interest in defense tech. Last week, Meta announced a new partnership with defense contractor Anduril Industries Inc. to develop products for the US military, including an AI-powered helmet with virtual and augmented reality features. Meta has also granted approval for US government agencies and defense contractors to use its AI models. The company is already partnering with Scale on a program called Defense Llama — a version of Meta's Llama large language model intended for military use. Scale has increasingly been working with the US government to develop AI for defense purposes. Earlier this year the startup said it won a contract with the Defense Department to work on AI agent technology. The company called the contract 'a significant milestone in military advancement.' Cavs Owner Dan Gilbert Wants to Donate His Billions—and Walk Again The SEC Pinned Its Hack on a Few Hapless Day Traders. The Full Story Is Far More Troubling Is Elon Musk's Political Capital Spent? What Does Musk-Trump Split Mean for a 'Big, Beautiful Bill'? Cuts to US Aid Imperil the World's Largest HIV Treatment Program ©2025 Bloomberg L.P.

Musk Says DOGE Hasn't Been as Effective as He Wanted — Are More Cuts Coming?
Musk Says DOGE Hasn't Been as Effective as He Wanted — Are More Cuts Coming?

Yahoo

time29 minutes ago

  • Yahoo

Musk Says DOGE Hasn't Been as Effective as He Wanted — Are More Cuts Coming?

Elon Musk said his high-profile effort to cut government waste with the Department of Government Efficiency (DOGE) has made 'some progress but not enough.' The tempered assessment comes amid reported tensions between Musk and President Donald Trump, whose administration launched the initiative. Although Musk announced his intention to step down from leadership of DOGE, the department will continue in its attempt to cut unnecessary spending by the federal government. Be Aware: Find Out: Musk said DOGE hasn't been as effective as he wanted. So, are more cuts coming? Musk envisioned DOGE as a transformative force to streamline federal operations. His ambitious plan aimed to eliminate wasteful spending, reduce bureaucracy and modernize government technology, with the ultimate goal of saving up to $2 trillion in taxpayer money. In his first 100 days leading DOGE, Musk claimed the team saved $1.6 billion a day, ABC News reported. However, he admitted the results fall short of his trillion-dollar goal. He blamed entrenched interests and bureaucracy, calling the reform process 'like turning a fleet of supertankers.' Specifically, Musk emphasized that achieving the revised goal of $1 trillion in federal spending cuts would depend on 'how much pain is the cabinet and Congress willing to take.' 'It can be done,' Musk told reporters. 'But it requires dealing with a lot of complaints.' Read Next: While Musk said DOGE saved $160 billion by cutting waste, an analysis cited by CBS News estimated the initiative could ultimately cost taxpayers $135 billion this fiscal year. The report, attributed to the nonpartisan Partnership for Public Service, outlined expenses tied to mismanaged staff cuts, lost productivity and administrative disruptions. In addition, some experts said the deeper issue was the assumption that government should operate like a business. They said that applying corporate strategies to public systems could create more disruption than efficiency. 'Running a government isn't like running a business,' said George Carrillo, co-founder and CEO of the Hispanic Construction Council. Carrillo previously served as the Director of Social Determinants of Health for the state of Oregon. 'It's not about moving fast to sell products or meet quarterly goals,' Carrillo said. 'Instead, it's a slower, more thoughtful process, where every decision impacts real people's lives.' Despite mixed results, the Trump administration is doubling down on DOGE's mission. The White House has formally requested that Congress rescind $9.4 billion in previously approved spending, targeting programs flagged by DOGE. If approved, the move would cement many of DOGE's proposed cuts and freezes, with Trump aides claiming the reductions focus on programs promoting liberal ideologies. 'This rescissions package reflects many of DOGE's findings and is one of the many legislative tools Republicans are using to restore fiscal sanity,' House Speaker Mike Johnson told reporters, as reported by AP News. Johnson pledged the House would bring the package to the floor 'as quickly as possible.' Although Musk has formally stepped down from his leadership role at DOGE, he continues to advise the department behind the scenes. His influence still looms large over the initiative's direction, with Johnson citing his original vision when defending new rounds of cuts. Whether his continued involvement will help DOGE regain momentum or further politicize its mission remains to be seen. Still, some policy experts said that Musk's expectations may clash with the realities of public governance. 'From healthcare programs to safety nets, government work is layered with legal checks and balances designed to avoid harm, and Musk might be underestimating how much that complexity slows down big changes,' Carrillo said. 'Without fully understanding the governance structure, he likely views DOGE's progress as sluggish when, in reality, it reflects the careful deliberation necessary to ensure fairness and accuracy.' As Congress weighs the $9.4 billion rescissions package and potential expansions to DOGE, the coming months will test whether the initiative can sustain momentum without Musk at the helm. 'There could be longer delays or disruptions in receiving services like unemployment benefits, tax refunds or healthcare support, all because restructuring slows processes down before any improvements can take hold,' Carrillo said. 'Beyond that, large-scale changes also take a long time to bear fruit, so even with the best intentions, consumers and workers should expect a period where things might feel worse before they get better.' Editor's note on political coverage: GOBankingRates is nonpartisan and strives to cover all aspects of the economy objectively and present balanced reports on politically focused finance stories. You can find more coverage of this topic on More From GOBankingRates 3 Luxury SUVs That Will Have Massive Price Drops in Summer 2025 3 Reasons Retired Boomers Shouldn't Give Their Kids a Living Inheritance (And 2 Reasons They Should) 5 Types of Cars Retirees Should Stay Away From Buying This article originally appeared on Musk Says DOGE Hasn't Been as Effective as He Wanted — Are More Cuts Coming? Error in retrieving data Sign in to access your portfolio Error in retrieving data Error in retrieving data Error in retrieving data Error in retrieving data

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into the world of global news and events? Download our app today from your preferred app store and start exploring.
app-storeplay-store