Latest news with #SimpleQA
Yahoo
20-05-2025
- Yahoo
Google talked AI for 2 hours. It didn't mention hallucinations.
This year, Google I/O 2025 had one focus: Artificial intelligence. We've already covered all of the biggest news to come out of the annual developers conference: a new AI video generation tool called Flow. A $250 AI Ultra subscription plan. Tons of new changes to Gemini. A virtual shopping try-on feature. And critically, the launch of the search tool AI Mode to all users in the United States. Yet over nearly two hours of Google leaders talking about AI, one word we didn't hear was "hallucination". Hallucinations remain one of the most stubborn and concerning problems with AI models. The term refers to invented facts and inaccuracies that large-language models "hallucinate" in their replies. And according to the big AI brands' own metrics, hallucinations are getting worse — with some models hallucinating more than 40 percent of the time. But if you were watching Google I/O 2025, you wouldn't know this problem existed. You'd think models like Gemini never hallucinate; you would certainly be surprised to see the warning appended to every Google AI Overview. ("AI responses may include mistakes".) The closest Google came to acknowledging the hallucination problem came during a segment of the presentation on AI Mode and Gemini's Deep Search capabilities. The model would check its own work before delivering an answer, we were told — but without more detail on this process, it sounds more like the blind leading the blind than genuine fact-checking. For AI skeptics, the degree of confidence Silicon Valley has in these tools seems divorced from actual results. Real users notice when AI tools fail at simple tasks like counting, spellchecking, or answering questions like "Will water freeze at 27 degrees Fahrenheit?" Google was eager to remind viewers that its newest AI model, Gemini 2.5 Pro, sits atop many AI leaderboards. But when it comes to truthfulness and the ability to answer simple questions, AI chatbots are graded on a curve. Gemini 2.5 Pro is Google's most intelligent AI model (according to Google), yet it scores just a 52.9 percent on the Functionality SimpleQA benchmarking test. According to an OpenAI research paper, the SimpleQA test is "a benchmark that evaluates the ability of language models to answer short, fact-seeking questions." (Emphasis ours.) A Google representative declined to discuss the SimpleQA benchmark, or hallucinations in general — but did point us to Google's official Explainer on AI Mode and AI Overviews. Here's what it has to say: [AI Mode] uses a large language model to help answer queries and it is possible that, in rare cases, it may sometimes confidently present information that is inaccurate, which is commonly known as 'hallucination.' As with AI Overviews, in some cases this experiment may misinterpret web content or miss context, as can happen with any automated system in Search... We're also using novel approaches with the model's reasoning capabilities to improve factuality. For example, in collaboration with Google DeepMind research teams, we use agentic reinforcement learning (RL) in our custom training to reward the model to generate statements it knows are more likely to be accurate (not hallucinated) and also backed up by inputs. Is Google wrong to be optimistic? Hallucinations may yet prove to be a solvable problem, after all. But it seems increasingly clear from the research that hallucinations from LLMs are not a solvable problem right now. That hasn't stopped companies like Google and OpenAI from sprinting ahead into the era of AI Search — and that's likely to be an error-filled era, unless we're the ones hallucinating.


Int'l Business Times
07-05-2025
- Int'l Business Times
OpenAI's Latest ChatGPT AI Models Are Smarter, But They Hallucinate More Than Ever
Artificial intelligence is evolving fast, but not always in the right direction. OpenAI's latest models, GPT o3 and o4-mini, were built to mimic human reasoning more closely than ever before. However, a recent internal investigation reveals an alarming downside: these models may be more intelligent, but they're also more prone to making things up. Hallucination in AI is a Growing Problem OpenAI's Latest ChatGPT AI Models Are Smarter, But They Hallucinate Since the birth of chatbots, hallucinations, also known as false or imaginary facts, have been a persistent issue. With each model iteration, the hope was that these AI hallucinations would decline. But OpenAI's latest findings suggest otherwise, according to The New York Times. In a benchmark test focused on public figures, GPT-o3 hallucinated in 33% of responses, twice the error rate of its predecessor, GPT-o1. Meanwhile, the more compact GPT o4-mini performed even worse, hallucinating nearly half the time (48%). Reasoning vs. Reliability: Is AI Thinking Too Hard? Unlike previous models that were great at generating fluent text, o3 and o4-mini were programmed to reason step-by-step, like human logic. Ironically, this new "reasoning" technique might be the problem. AI researchers say that the more a model does reasoning, the more likely it is to go astray. Unlike low-flying systems that stay with secure, high-confidence responses, these newer systems attempt to bridge between complicated concepts, which can cause bizarre and incorrect conclusions. On the SimpleQA test, which tests general knowledge, the performance was even worse: GPT o3 hallucinated on 51% of responses, while o4-mini shot to an astonishing 79%. These are not small errors; these are huge credibility gaps. Why More Sophisticated AI Models May Be Less Credible OpenAI attributes the rise in AI hallucinations to possibly not being the result of the reasoning itself, but of the verbosity and boldness of the models. While attempting to be useful and comprehensive, the AI begins to guess and sometimes mixes theory with fact. The outcome will sound very convincing, but they're entirely incorrect answers. According to TechRadar, this becomes especially risky when AI is employed in high-stakes environments such as law, medicine, education, or government service. A single hallucinated fact in a legal brief or medical report could have disastrous repercussions. The Real-World Risks of AI Hallucinations We already know attorneys were sanctioned for providing fabricated court citations produced by ChatGPT. But what about minor mistakes in a business report, school essay, or government policy memo? The more integrated AI becomes into our everyday routines, the fewer opportunities there are for error. The paradox is simple: the more helpful AI is, the more perilous its mistakes are. You can't save people time if they still need to fact-check everything. Treat AI Like a Confident Intern Though GPT o3 and o4-mini demonstrate stunning skills in coding, logic, and analysis, their propensity to hallucinate means users can't rely on them when they require rock-solid facts. Until OpenAI and its rivals are able to minimize these hallucinations, users need to take AI output with a grain of salt. Consider it this way: These chatbots are similar to that in-your-face co-worker who always has a response, but you still fact-check everything they state. Originally published on Tech Times


Arabian Post
24-04-2025
- Science
- Arabian Post
Liner Edges Ahead in AI-Powered Research Battle
Deep research, once the domain of academics, analysts, and professionals poring over databases and archives, is rapidly being transformed by artificial intelligence. Tools like Liner, ChatGPT, and Perplexity have redefined what it means to explore a subject in depth. These platforms promise not only to automate research but to enhance it—consolidating data, extracting patterns, and offering structured, referenced summaries that would normally take hours or days to compile. Yet despite their shared aim, each platform brings distinct strengths and limitations to the table. The core idea behind these platforms is to go beyond mere data retrieval. Deep research tools are expected to contextualize information, synthesize insights, and present arguments in a way that aligns with academic and professional standards. This isn't simply about answering a question—it's about understanding why the answer matters, how it was derived, and whether the sources used are reliable. The user, whether a student, a journalist, or a corporate strategist, depends on clarity, speed, accuracy, and trustworthiness. That's where the divergence begins. Testing three complex questions across all platforms illuminated major differences. The first and most noticeable contrast appeared in response times. Liner consistently delivered results in under two minutes, even when faced with multi-layered prompts involving statistics, case studies, and longitudinal data. ChatGPT, operating under its GPT-4.5 framework, was considerably slower—taking more than 15 minutes in some instances. This delay is likely linked to the tool's attempt to provide more nuanced, human-like responses, but in environments where time is critical, the tradeoff becomes an obstacle. Perplexity struck a middle ground, balancing speed and detail more effectively, although it occasionally lagged when prompted with nested or ambiguous queries. Beyond speed, the second point of divergence lies in reliability and citation integrity. When examining the accuracy of each tool using a recognized metric—OpenAI's SimpleQA benchmark—Liner scored 95.3, a clear lead over ChatGPT's 62.5. Perplexity landed just behind Liner at 93.9, demonstrating strong parity in understanding direct and fact-based inquiries. This gap in performance indicates that while ChatGPT excels in conversational coherence, it sometimes falters in delivering pinpoint accuracy when stakes are academic or legal in nature. Its preference for blog content or Wikipedia citations occasionally undermines its utility in rigorous settings. Liner's edge here stems from its source prioritization and integration with curated databases. Instead of pulling from a broad and often inconsistent web, Liner tends to lean on academic journals, verified industry reports, and governmental datasets. This makes it particularly useful in fields where citations must hold up to scrutiny, such as policy research or financial forecasting. While Perplexity also provides references, they vary in quality and are not always traceable to original documents. Liner, by contrast, typically includes clickable source chains and detailed metadata, providing transparency and accountability—features that are often dealbreakers for serious researchers. Usability and readability form the third pillar of differentiation. Each tool attempts to simplify the research output for end users by segmenting answers, linking references, and offering suggested follow-ups. Liner distinguishes itself again by providing visual aids—charts, graphs, and interactive tables—particularly in economics and business contexts. This collaboration with Tako, an analytics visualization partner, allows users to digest dense datasets at a glance, something neither ChatGPT nor Perplexity currently matches at scale. Even when dealing with qualitative questions—those that rely less on data and more on discourse—Liner's structure-oriented response style creates a noticeable user experience advantage. ChatGPT, while fluid and often more conversational, sometimes meanders in tone or includes speculative commentary unless tightly constrained. Perplexity, though more focused, can produce rigid or formulaic responses that lack the natural flow needed to synthesize subjective or interdisciplinary topics. Where the comparison becomes nuanced is in the balance between human-like interaction and structured output. ChatGPT remains unparalleled in mimicking human dialogue and crafting responses that feel personalized. For journalists or creative professionals exploring themes or ideating around a topic, this natural tone can be a creative asset. But when precision and academic rigor are non-negotiable, this stylistic flexibility becomes a potential pitfall. The platform may inadvertently introduce interpretative bias or dilute its own claims by relying on lower-grade citations. Conversely, Liner's format is ideal for those looking to plug results directly into a report, brief, or paper. Its ability to extract and format source content into bullet-pointed frameworks, annotated visuals, and context-aware overviews ensures that users spend less time editing and formatting the results. This doesn't mean it is flawless—there are occasional formatting glitches, especially when integrating tables with textual outputs—but its design remains more conducive to professional and academic use. Perplexity often appeals to users looking for a blend between the two extremes. Its UI is cleaner than ChatGPT's, its results more modular than Liner's, and its focus on conciseness ensures that the information presented doesn't overwhelm. However, its major drawback lies in source depth and specificity. While it is commendable in general web research, its limitations become visible when tasked with field-specific exploration such as advanced medical literature, case law, or geo-political analysis. It provides a well-packaged generalist overview but rarely dives deep enough to stand on its own in a footnoted academic context. Another area where Liner stands apart is its responsiveness to iterative refinement. Users can tweak their prompts, narrow the scope of queries, or expand on specific angles without restarting the entire session. It remembers context more effectively and allows for branching exploration—something ChatGPT only handles within limited session memory and Perplexity struggles with unless queries are restated clearly each time. From a user experience standpoint, aesthetics and interface design also play a subtle but important role. Liner's dashboard is intentionally minimalist, with collapsible citation panels and customizable output formatting. ChatGPT leans into its chat-style layout, which, while user-friendly, lacks scalability for research-heavy tasks. Perplexity's search-focused interface mimics traditional search engines, which can be comforting for first-time users but feels limiting over extended research workflows. Price is another factor that could sway users, especially students or freelancers. ChatGPT operates on a freemium model, where advanced capabilities require a subscription. Liner also uses a tiered approach, with most of its deep research functionality behind a paywall. Perplexity currently offers more free access but with noticeable tradeoffs in output complexity and customization.
Yahoo
01-03-2025
- Science
- Yahoo
OpenAI Admits That Its New Model Still Hallucinates More Than a Third of the Time
If a partner or friend made stuff up a significant percentage of the time that you asked a question, it would be a huge problem for the relationship. But apparently it's different for OpenAI's hot new model. Using SimpleQA, the company's in-house factuality benchmarking tool, OpenAI admitted in its release announcement that its new large language model (LLM) GPT-4.5 hallucinates — which is AI parlance for confidently spewing fabrications and presenting them as fact — 37 percent of the time. Yes, you read that right: in tests, the latest AI model from a company that's worth hundreds of billions of dollars is telling lies for more than one out of every three answers it gives. As if that wasn't bad enough, OpenAI is actually trying to spin GPT-4.5's bullshitting problem as a good thing because — get this — it doesn't hallucinate as much as the company's other LLMs. The same graph [can we embed a screenshot below?] that showed how often the new model spews nonsense also reports that GPT-4o, a purportedly advanced "reasoning" model, hallucinates 61.8 percent of the time on the SimpleQA benchmark. OpenAI's o3-mini, a cheaper and smaller version of its reasoning model, was found to hallucinate a whopping 80.3 percent of the time. Of course, the problem isn't unique to OpenAI. "At present, even the best models can generate hallucination-free text only about 35 percent of the time," explained Wenting Zhao, a Cornell doctoral student who co-wrote a paper last year about AI hallucination rates, in an interview about the research with TechCrunch. "The most important takeaway from our work is that we cannot yet fully trust the outputs of model generations." Beyond the incredulity of a company getting hundreds of billions of dollars in investments for products that have such issues telling the truth, it says a lot about the AI industry at large that these are the things they're selling us: expensive, resource-consuming systems that are supposed to be approaching human-level intelligence but still can't get basic facts right. As OpenAI's LLMs plateau in performance, the company is clearly grasping at straws to re-steer the hype ship back on the course it seemed to chart when ChatGPT first dropped. But to do that, we're probably going to need to see a real breakthrough, not more of the same. More on AI hallucinations: Even the Most Advanced AI Has a Problem: If It Doesn't Know the Answer, It Makes One Up
Yahoo
27-02-2025
- Business
- Yahoo
OpenAI's new GPT-4.5 model is a better, more natural conversationalist
In what has already been a busy past few days for new model releases, OpenAI is capping off the week with a research preview of GPT-4.5. The company is touting the new system as its largest and best model for chat yet. In early testing, OpenAI says people found GPT-4.5 to be a more natural conversationalist, with the ability to convey warmth and display a kind of emotional intelligence. In one example shared by OpenAI, a person tells ChatGPT they're going through a hard time after failing a test. Where the company's previous models, including GPT-4o and o3-mini, might commiserate with the individual before offering a long list of unsolicited advice, GPT-4.5 takes a different tact. "Want to talk about what happened, or do you just need a distraction? I'm here either way," the chatbot says when powered by GPT-4.5. The gains shown by GPT-4.5 are the result of advancements OpenAI made in unsupervised learning. With unsupervised learning, a machine learning algorithm is given an unlabeled data set and left to its own devices to find patterns and insights. GPT-4.5 doesn't "think" like the company's state-of-the-art reasoning models, but in training the new model OpenAI made architectural enhancements and gave it access to more data and compute power. "The result is a model that has broader knowledge and a deeper understanding of the world, leading to reduced hallucinations," the company says. See for yourself — The Yodel is the go-to source for daily news, entertainment and feel-good stories. By signing up, you agree to our Terms and Privacy Policy. Speaking of reduced hallucinations, OpenAI measured how much better GPT-4.5 in that regard. When put through SimpleQA, an OpenAI-designed benchmark that tests large language models on their ability to answer "straightforward but challenging knowledge questions," GPT-4.5 beat out o3-mini, GPT-4o and even o1 with a hallucination rate of 37.1 percent. Obviously, the new model doesn't solve the problem of AI hallucinations altogether, but it is a step in the right direction. Despite its relative strengths over GPT-4o and o3-mini, GPT-4.5 isn't a direct replacement for those models. Compared to OpenAI's reasoning systems, GPT-4.5 is "a more general-purpose, innately smarter model." Additionally, it's not natively multimodal like GPT-4o, meaning it doesn't work with features like Voice Mode, video or screensharing. It's also "a very large and compute-intensive model." It's best to think of GPT-4.5 as a stepping stone to systems OpenAI plans to offer in the future. In fact, Sam Altman said as much earlier this month when he shared the company's roadmap, noting GPT-4.5 would be "our last non-chain-of-thought model" — referring to the fact that the new system doesn't solve problems by tackling them step by step like OpenAI's reasoning models do. Its successor, GPT-5, will likely integrate many of OpenAI's latest technologies, including its frontier o3 model. OpenAI reiterated that today, saying it plans to bring GPT-4.5's "unique strengths, including broader knowledge, stronger intuition, and greater 'EQ,' to all users in future models." In the meantime, ChatGPT Pro subscribers can begin using GPT-4.5 starting today, with Pro and Team users slated to gain access starting next week.