Popular AIs head-to-head: OpenAI beats DeepSeek on sentence-level reasoning
ChatGPT and other AI chatbots based on large language models are known to occasionally make things up, including scientific and legal citations. It turns out that measuring how accurate an AI model's citations are is a good way of assessing the model's reasoning abilities.
An AI model 'reasons' by breaking down a query into steps and working through them in order. Think of how you learned to solve math word problems in school.
Ideally, to generate citations an AI model would understand the key concepts in a document, generate a ranked list of relevant papers to cite, and provide convincing reasoning for how each suggested paper supports the corresponding text. It would highlight specific connections between the text and the cited research, clarifying why each source matters.
The question is, can today's models be trusted to make these connections and provide clear reasoning that justifies their source choices? The answer goes beyond citation accuracy to address how useful and accurate large language models are for any information retrieval purpose.
I'm a computer scientist. My colleagues − researchers from the AI Institute at the University of South Carolina, Ohio State University and University of Maryland Baltimore County − and I have developed the Reasons benchmark to test how well large language models can automatically generate research citations and provide understandable reasoning.
We used the benchmark to compare the performance of two popular AI reasoning models, DeepSeek's R1 and OpenAI's o1. Though DeepSeek made headlines with its stunning efficiency and cost-effectiveness, the Chinese upstart has a way to go to match OpenAI's reasoning performance.
The accuracy of citations has a lot to do with whether the AI model is reasoning about information at the sentence level rather than paragraph or document level. Paragraph-level and document-level citations can be thought of as throwing a large chunk of information into a large language model and asking it to provide many citations.
In this process, the large language model overgeneralizes and misinterprets individual sentences. The user ends up with citations that explain the whole paragraph or document, not the relatively fine-grained information in the sentence.
Further, reasoning suffers when you ask the large language model to read through an entire document. These models mostly rely on memorizing patterns that they typically are better at finding at the beginning and end of longer texts than in the middle. This makes it difficult for them to fully understand all the important information throughout a long document.
Large language models get confused because paragraphs and documents hold a lot of information, which affects citation generation and the reasoning process. Consequently, reasoning from large language models over paragraphs and documents becomes more like summarizing or paraphrasing.
The Reasons benchmark addresses this weakness by examining large language models' citation generation and reasoning.
Following the release of DeepSeek R1 in January 2025, we wanted to examine its accuracy in generating citations and its quality of reasoning and compare it with OpenAI's o1 model. We created a paragraph that had sentences from different sources, gave the models individual sentences from this paragraph, and asked for citations and reasoning.
To start our test, we developed a small test bed of about 4,100 research articles around four key topics that are related to human brains and computer science: neurons and cognition, human-computer interaction, databases and artificial intelligence. We evaluated the models using two measures: F-1 score, which measures how accurate the provided citation is, and hallucination rate, which measures how sound the model's reasoning is − that is, how often it produces an inaccurate or misleading response.
Our testing revealed significant performance differences between OpenAI o1 and DeepSeek R1 across different scientific domains. OpenAI's o1 did well connecting information between different subjects, such as understanding how research on neurons and cognition connects to human-computer interaction and then to concepts in artificial intelligence, while remaining accurate. Its performance metrics consistently outpaced DeepSeek R1's across all evaluation categories, especially in reducing hallucinations and successfully completing assigned tasks.
OpenAI o1 was better at combining ideas semantically, whereas R1 focused on making sure it generated a response for every attribution task, which in turn increased hallucination during reasoning. OpenAI o1 had a hallucination rate of approximately 35% compared with DeepSeek R1's rate of nearly 85% in the attribution-based reasoning task.
In terms of accuracy and linguistic competence, OpenAI o1 scored about 0.65 on the F-1 test, which means it was right about 65% of the time when answering questions. It also scored about 0.70 on the BLEU test, which measures how well a language model writes in natural language. These are pretty good scores.
DeepSeek R1 scored lower, with about 0.35 on the F-1 test, meaning it was right about 35% of the time. However, its BLEU score was only about 0.2, which means its writing wasn't as natural-sounding as OpenAI's o1. This shows that o1 was better at presenting that information in clear, natural language.
On other benchmarks, DeepSeek R1 performs on par with OpenAI o1 on math, coding and scientific reasoning tasks. But the substantial difference on our benchmark suggests that o1 provides more reliable information, while R1 struggles with factual consistency.
Though we included other models in our comprehensive testing, the performance gap between o1 and R1 specifically highlights the current competitive landscape in AI development, with OpenAI's offering maintaining a significant advantage in reasoning and knowledge integration capabilities.
These results suggest that OpenAI still has a leg up when it comes to source attribution and reasoning, possibly due to the nature and volume of the data it was trained on. The company recently announced its deep research tool, which can create reports with citations, ask follow-up questions and provide reasoning for the generated response.
The jury is still out on the tool's value for researchers, but the caveat remains for everyone: Double-check all citations an AI gives you.
This article is republished from The Conversation, a nonprofit, independent news organization bringing you facts and trustworthy analysis to help you make sense of our complex world. It was written by: Manas Gaur, University of Maryland, Baltimore County
Read more:
Why building big AIs costs billions – and how Chinese startup DeepSeek dramatically changed the calculus
What is an AI agent? A computer scientist explains the next wave of artificial intelligence tools
AI pioneers want bots to replace human teachers – here's why that's unlikely
Manas Gaur receives funding from USISTEF Endowment Fund.

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles

Engadget
37 minutes ago
- Engadget
Trump reportedly plans to extend the pause on the TikTok ban yet again
President Donald Trump plans to extend the pause on enforcing the TikTok ban once again, The Wall Street Journal reports. Trump previously extended the pause on April 4 to give TikTok, its potential US buyers and the Chinese government more time to reach an agreement, but the deal has made little progress since then. The current pause on enforcement is supposed to end on June 19. Given the state of the negotiations between the US and China, the odds of a deal being reached before then seem highly unlikely. The issue hasn't been helped by the fluctuating tariffs the Trump administration has attempted to levy on goods manufactured outside of the US, which started on April 2 and at one point included a 125 percent tariff on everything shipped from China. The two countries are expected to restart trade negotiations at some point in the near-future, according to The New York Times , but there's been no public mention of a TikTok sale being a key part of the discussion. A number of investors are still looking to own a piece of the US version of the app, including the software company Oracle, which has a pre-existing relationship with TikTok as a cloud provider. The TikTok ban went into effect on January 19. TikTok tried to appeal the ban beforehand, but the Supreme Court ultimately decided to uphold it, prompting Trump's first executive order pausing the ban on January 20, 2025.
Yahoo
42 minutes ago
- Yahoo
Analyst Says Alphabet (GOOG) Trading at ‘Huge Discount' – Believes It Can ‘Absolutely' Compete with ChatGPT
Ted Thatcher, president of Bright Lake Wealth Management, explained in a recent program on Schwab Network why he is bullish on Alphabet Inc (NASDAQ:GOOG) and mentioned some of the key growth catalysts for the stock: "At least from a price to earnings ratio, you know, Google's trading around 18 times earnings. If you look at that compared to the other mega caps, it's like at a huge discount. And I understand we've had these concerns about whether Gemini versus Chat GPT usage is competitive or not. But we have to remember, you know, look at Google's revenue. Look at what happened out of their earnings calls. Even their blue link search revenue was up 10%. And then you look at Google more broadly, they have 270 million users, paid subscribers, between their YouTube platform and their Google One platforms. And so when you look at that level of user control that Google has and then you pair that with everything that they just announced at I/O with all these essentially AI apps, some of them authentic, some of them are AI mode with Google Search, you know, I think Google is going to be able to absolutely cram down Gemini through those users' throats, essentially, and absolutely compete with the Chat GPT user growth." Alphabet posted strong quarterly results, but the market remains reluctant about the stock amid threats to its search business due to the onslaught of AI tools like ChatGPT. However, Alphabet Inc. (NASDAQ:GOOG) bulls believe these concerns are overstated. Google has an edge over competitors because it's easier for the billions of users of its search engine to switch to Gemini instead of opting for a completely new model. Google has over 1.5 billion monthly users interacting with its AI-powered Search overviews. OpenAI, Alphabet's biggest competitor now when it comes to AI search, has less than 5% of its users paying, and its business model is still developing. Google's first-quarter results showed continued strength in its cloud unit, with revenue up 28% year over year and solid operating income growth. This supports Google's broader AI strategy and underscores the scale advantages of its cloud business. RiverPark Large Growth Fund stated the following regarding Alphabet Inc. (NASDAQ:GOOG) in its Q1 2025 investor letter: 'Alphabet Inc. (NASDAQ:GOOG) shares declined in the first quarter despite solid business fundamentals. Investors reacted negatively to the company's AI product rollout relative to peers, expressing concern about potential market share loss in core search and cloud segments. Nevertheless, Alphabet continues to deliver strong cash flow from its dominant search and YouTube platforms while investing aggressively in its next-generation AI and cloud infrastructure. Trading at a valuation discount to many of its peers, we view Alphabet as a core holding with compelling upside as monetization of its newer initiatives accelerates.' READ NEXT: and . Disclosure: None. This article is originally published at Insider Monkey. Error in retrieving data Sign in to access your portfolio Error in retrieving data Error in retrieving data Error in retrieving data Error in retrieving data
Yahoo
an hour ago
- Yahoo
Trump planning to extend TikTok ban deadline: Reports
According to various reports, President Trump may be considering another extension to the TikTok ban deadline — June 19 — which originally outlined a stoppage to US operations unless Chinese parent company ByteDance divested from the popular social media platform. Yahoo Finance tech editor Dan Howley reports more on this. To watch more expert insights and analysis on the latest market action, check out more Market Domination here. President Trump is planning to extend the TikTok deadline, again, um, at least that's according to a report. So what can you tell us about that? Yeah, TikTok was supposed to go dark or something was supposed to happen. We don't really know, uh, later this month. The the original deadline was back in April, and that was extended and now the June deadline's coming up. Uh and this is uh, according to the New York Post, they're they're going to extend it again, or President Trump could extend it again. There's been talk back and forth about, you know, deals going on between US companies and trying to maybe tie this up with with a trade deal with China, figure out how to get TikTok to be a US company, or not owned primarily by a Chinese entity. Uh you know, there's been talk about different companies getting in there. Oracle, Walmart was floated around uh for a while. It it really doesn't seem like there's any, you know, outside movement to to the outside world. There it doesn't seem to be any any movement. Uh obviously, these are all negotiations behind closed doors, but I think, you know, it at this point, Trump has shown that he he supports TikTok, he likes TikTok. He's, you know, kind of credits it with helping him win over younger voters. I don't think he's gonna let it go away. It at this point, though, it's just how far can you kick this can down the road before something happens? Dan, thank you. Appreciate it. Error in retrieving data Sign in to access your portfolio Error in retrieving data Error in retrieving data Error in retrieving data Error in retrieving data