logo
#

Latest news with #PersonQA

Contradictheory: AI and the next generation
Contradictheory: AI and the next generation

The Star

time7 days ago

  • Lifestyle
  • The Star

Contradictheory: AI and the next generation

Here's a conversation I don't think we'd have heard five years ago: 'You know what they do? They send in their part of the work, and it's so obviously ChatGPT. I had to rewrite the whole thing!' This wasn't a chat I had with the COO of some major company but with a 12-year-old child. She was talking about a piece of group work they had to do for class. And this Boy as she called him (you could hear the capitalised italics in her voice) had waited until the last minute to submit his part. To be honest, I shouldn't be surprised. These days, lots of people use AI in their work. It's normal. According to the 2024 Work Trend Index released by Microsoft and LinkedIn, 75% of employees then used artificial intelligence (AI) to save time and focus on their most important tasks. But it's not without its problems. An adult using AI to help draft an email is one thing. A student handing in their weekly assignment is another. The adult uses AI to communicate more clearly, but there the student is taking a shortcut. So, in an effort to deliver better work, the child might actually be learning less. And it's not going away. A 2024 study by Impact Research for the Walton Family Foundation found that 48% of students use ChatGPT at least weekly, representing a jump of 27 percentage points over 2023. And more students use AI chatbots to write essays and assignments (56%) than to study for tests and quizzes (52%). So what about the other students that don't use AI, like the girl I quoted above? I find they often take a rather antagonistic view. Some kids I talk to (usually the ones already doing well in class) seem to look down on classmates who use AI and, in the process, they look down on AI to do their homework as well. And I think that's wrong. As soon as I learned about ChatGPT, I felt that the key to using AI tools well is obvious. It lies in its name: tools. Like a ruler for drawing straight lines, or a dictionary for looking up words, AI chatbots are tools, only more incredibly versatile ones. One of the biggest problems, of course, is that AI chatbots don't always get their facts right (in AI parlance, they 'hallucinate'). So if you ask it for an essay on 'fastest marine mammal', there's a chance it'll include references to 'sailfish' and 'peregrine falcon'. In one test of AI chatbots, hallucination rates for newer AI systems were as high as 79%. Even OpenAI, the company behind ChatGPT, isn't immune. Their o3 release hallucinated 33% of the time in their PersonQA benchmark test, which measures how well it answers questions about public figures. The new o4-mini performed even worse, hallucinating 48% of the time. There are ways to work around this, but I think most people don't know them. For example, many chatbots now have a 'Deep Research' mode that actively searches the internet and presents answers along with sources. The thing about this is that you, the reasonable, competent, and capable human being, can check the original source to see if it's something you trust. Instead of the machine telling you what it 'knows', it tells you what it found, and it's up to you to verify it. Another method is to feed the chatbot the materials you want it to use, like a PDF of your textbook or a research paper. Google's NotebookLM is designed for this. It only works with the data you supply, drastically reducing hallucinations. You can then be more sure of the information it produces. In one stroke, you've turned the chatbot into a hyper-intelligent search engine that not only finds what you're looking for but also understands context, identifies patterns, and helps organise the information. That's just a small part of what AI can do. But even just helping students find and organise information better is a huge win. And ideally, teachers should lead the charge in classrooms, guiding students on how to work with AI responsibly and effectively. Instead, many feel compelled to ban it or to try to 'AI-proof' assignments, for example, by demanding handwritten submissions or choosing topics that chatbots are more likely to hallucinate on. But we can do better. We should allow AI in and teach students how to use it in a way that makes them better. For example, teachers could say that the 'slop' AI generates is the bare minimum. Hand it in as-is, and you'll scrape a C or D. But if you use it to refine your thoughts, to polish your voice, to spark better ideas, then that's where the value lies. And students can use it to help them revise by getting it to generate quizzes to test themselves with (they, of course, have to verify the answers the AI gives are correct). Nevertheless, what I've written about so far is about using AI as a tool. The future is about using it as a collaborator. Right now, according to the 2025 Microsoft Work Trend Index, while 50% see it as a command-based tool, 48% of Malaysian workers treat AI as a thought partner. The former issues basic instructions, while the latter has conversations and you have human-machine collaboration. The report goes on to say explicitly that this kind of partnership is what all employees should strive for when working with AI. That means knowing how to iterate the output given, when to delegate, when to refine the results, and when to push back. In short: the same skills we want kids to learn anyway when working with classmates and teachers. And the truth is that while I've used AI to find data, summarise reports, and – yes – to proofread this article, I haven't yet actively collaborated with AI. However, the future seems to be heading in that direction. Just a few weeks ago, I wrote about mathematician Terence Tao who predicts that it won't be long until computer proof assistants powered by AI may be cited as co-authors on mathematics papers. Clearly, I still have a lot to learn about using AI day-to-day. And it's hard. It involves trial and error and wasted effort while battling with looming deadlines. I may deliver inferior work in the meantime that collaborators may have to rewrite. But I remain, as ever, optimistic. Because technology – whether as a tool or a slightly eccentric collaborator – has ultimately the potential to make us and our work better. Logic is the antithesis of emotion but mathematician-turned-scriptwriter Dzof Azmi's theory is that people need both to make sense of life's vagaries and contradictions. Write to Dzof at lifestyle@ The views expressed here are entirely the writer's own.

OpenAI's o3 and o4-mini hallucinate way higher than previous models
OpenAI's o3 and o4-mini hallucinate way higher than previous models

Yahoo

time20-05-2025

  • Yahoo

OpenAI's o3 and o4-mini hallucinate way higher than previous models

By OpenAI's own testing, its newest reasoning models, o3 and o4-mini, hallucinate significantly higher than o1. First reported by TechCrunch, OpenAI's system card detailed the PersonQA evaluation results, designed to test for hallucinations. From the results of this evaluation, o3's hallucination rate is 33 percent, and o4-mini's hallucination rate is 48 percent — almost half of the time. By comparison, o1's hallucination rate is 16 percent, meaning o3 hallucinated about twice as often. SEE ALSO: All the AI news of the week: ChatGPT debuts o3 and o4-mini, Gemini talks to dolphins The system card noted how o3 "tends to make more claims overall, leading to more accurate claims as well as more inaccurate/hallucinated claims." But OpenAI doesn't know the underlying cause, simply saying, "More research is needed to understand the cause of this result." OpenAI's reasoning models are billed as more accurate than its non-reasoning models like GPT-4o and GPT-4.5 because they use more computation to "spend more time thinking before they respond," as described in the o1 announcement. Rather than largely relying on stochastic methods to provide an answer, the o-series models are trained to "refine their thinking process, try different strategies, and recognize their mistakes." However, the system card for GPT-4.5, which was released in February, shows a 19 percent hallucination rate on the PersonQA evaluation. The same card also compares it to GPT-4o, which had a 30 percent hallucination rate. In a statement to Mashable, an OpenAI spokesperson said, 'Addressing hallucinations across all our models is an ongoing area of research, and we're continually working to improve their accuracy and reliability.' Evaluation benchmarks are tricky. They can be subjective, especially if developed in-house, and research has found flaws in their datasets and even how they evaluate models. Plus, some rely on different benchmarks and methods to test accuracy and hallucinations. HuggingFace's hallucination benchmark evaluates models on the "occurrence of hallucinations in generated summaries" from around 1,000 public documents and found much lower hallucination rates across the board for major models on the market than OpenAI's evaluations. GPT-4o scored 1.5 percent, GPT-4.5 preview 1.2 percent, and o3-mini-high with reasoning scored 0.8 percent. It's worth noting o3 and o4-mini weren't included in the current leaderboard. That's all to say; even industry standard benchmarks make it difficult to assess hallucination rates. Then there's the added complexity that models tend to be more accurate when tapping into web search to source their answers. But in order to use ChatGPT search, OpenAI shares data with third-party search providers, and Enterprise customers using OpenAI models internally might not be willing to expose their prompts to that. Regardless, if OpenAI is saying their brand-new o3 and o4-mini models hallucinate higher than their non-reasoning models, that might be a problem for its users. UPDATE: Apr. 21, 2025, 1:16 p.m. EDT This story has been updated with a statement from OpenAI.

OpenAI's latest AI models report high ‘hallucination' rate: What does it mean — and why is this significant?
OpenAI's latest AI models report high ‘hallucination' rate: What does it mean — and why is this significant?

Indian Express

time15-05-2025

  • Indian Express

OpenAI's latest AI models report high ‘hallucination' rate: What does it mean — and why is this significant?

A technical report released by artificial intelligence (AI) research organisation OpenAI last month found that the company's latest models — o3 and o4-mini — generate more errors than its older models. Computer scientists call the errors made by chatbots 'hallucinations'. The report revealed that o3 — OpenAI's most powerful system — hallucinated 33% of the time when running its PersonQA benchmark test, which involves answering questions about public figures. The o4-mini hallucinated at 48%. To make matters worse, OpenAI said it does not even know why these models are hallucinating more than their predecessors. Here is a look at what AI hallucinations are, why they happen, and why the new report about OpenAI's models is significant. When the term AI hallucinations began to be used to refer to errors made by chatbots, it had a very narrow definition. It was used to refer to those instances when AI models would give fabricated information as output. For instance, in June 2023, a lawyer in the United States admitted using ChatGPT to help write a court filing as the chatbot had added fake citations to the submission, which pointed to cases that never existed. Today, hallucination has become a blanket term for various types of mistakes made by chatbots. This includes instances when the output is factually correct but not actually relevant to the question that was asked. ChatGPT, o3, o4-mini, Gemini, Perplexity, Grok and many more are all examples of what are known as large language models (LLMs). These models essentially take in text inputs and generate synthesised outputs in the form of text. LLMs are able to do this as they are built using massive amounts of digital text taken from the Internet. Simply put, computer scientists feed these models a lot of text, helping them identify patterns and relationships within that text, and predict text sequences and produce some output in response to a user's input (known as a prompt). Note that LLMs are always making a guess while giving an output. They do not know for sure what is true and what is not — these models cannot even fact-check their output against, let's say, Wikipedia like humans can. LLMs 'know what words are and they know which words predict which other words in the context of words. They know what kinds of words cluster together in what order. And that's pretty much it. They don't operate like you and me,' scientist Gary Marcus wrote on his Substack, Marcus on AI. As a result, when an LLM is trained on, for example, inaccurate text, they give inaccurate outputs, thereby hallucinating. However, even accurate text cannot stop LLMs from making mistakes. That's because to generate new text (in response to a prompt), these models combine billions of patterns in unexpected ways. So, there is always a possibility that LLMs give fabricated information as output. And as LLMs are trained on vast amounts of data, experts do not understand why they generate a particular sequence of text at a given moment. Hallucination has been an issue with AI models from the start, and big AI companies and labs, in the initial years, repeatedly claimed that the problem would be resolved in the near future. It did seem possible, as after they were first launched, models tended to hallucinate less with each update. However, after the release of the new report about OpenAI's latest models, it has increasingly become clear that hallucination is here to stay. Also, the issue is not limited to just OpenAI. Other reports have shown that Chinese startup DeepSeek's R-1 model has double-digit rises in hallucination rates compared with previous models from the company. This means that the application of AI models has to be limited, at least for now. They cannot be used, for example, as a research assistant (as models create fake citations in research papers) or a paralegal-bot (because models give imaginary legal cases). Computer scientists like Arvind Narayanan, who is a professor at Princeton University, think that, to some extent, hallucination is intrinsic to the way LLMs work, and as these models become more capable, people will use them for tougher tasks where the failure rate will be high. In a 2024 interview, he told Time magazine, 'There is always going to be a boundary between what people want to use them [LLMs] for, and what they can work reliably at… That is as much a sociological problem as it is a technical problem. And I do not think it has a clean technical solution.'

AI hallucination puts firms at risk? New insurance covers legal costs
AI hallucination puts firms at risk? New insurance covers legal costs

Business Standard

time12-05-2025

  • Business
  • Business Standard

AI hallucination puts firms at risk? New insurance covers legal costs

Insurers at Lloyd's of London have introduced a new insurance product designed to protect businesses from financial losses arising from artificial intelligence system failures, according to a report by The Financial Times. The insurance, developed by Y Combinator-backed start-up Armilla, provides coverage for legal claims against companies when AI tools generate inaccurate outputs. The policy offers financial protection against potential legal consequences, including court-awarded damages and associated legal expenses. It responds to rising concerns over AI's tendency to produce unreliable or misleading information—commonly referred to as "hallucinations" in AI terminology. As companies increasingly integrate AI tools to enhance efficiency, they also face growing risks from errors caused by flaws in AI models that lead to hallucinations or fabricated information. Last year, a tribunal ruled that Air Canada must honour a discount its customer service chatbot had wrongly offered. What is an AI hallucination? An AI hallucination occurs when an algorithm generates information that appears credible but is actually false or misleading. Computer scientists use the term to describe such errors, which have been seen in various AI tools. These hallucinations can cause significant problems when AI is used in sensitive areas. While some errors are relatively harmless—such as a chatbot giving a wrong answer—others can have serious consequences. In high-stakes settings like legal cases or health insurance decisions, inaccuracies can severely impact people's lives. Unlike systems that follow strict, human-defined rules, AI models operate based on statistical patterns and probabilities, which makes occasional errors inevitable. Though minor mistakes may not pose a big problem for most users, hallucinations become critical when dealing with legal, medical, or confidential business matters. Karthik Ramakrishnan, Armilla's chief executive, said the new product could encourage more companies to adopt AI by addressing fears that tools like chatbots might break down or make errors. Hallucinations getting worse despite AI advances Despite improvements by companies like OpenAI and Google in reducing hallucination rates, the problem has worsened with the introduction of newer reasoning models. OpenAI's internal assessments found that its latest models hallucinate more often than earlier versions. Specifically, OpenAI reported that its most advanced model, o3, produced hallucinations 33 per cent of the time on the PersonQA benchmark, which tests the ability to answer questions about public figures—more than double the rate of its earlier model, o1.

OpenAI's new reasoning AI models hallucinate more
OpenAI's new reasoning AI models hallucinate more

Yahoo

time18-04-2025

  • Yahoo

OpenAI's new reasoning AI models hallucinate more

OpenAI's recently launched o3 and o4-mini AI models are state-of-the-art in many respects. However, the new models still hallucinate, or make things up — in fact, they hallucinate more than several of OpenAI's older models. Hallucinations have proven to be one of the biggest and most difficult problems to solve in AI, impacting even today's best-performing systems. Historically, each new model has improved slightly in the hallucination department, hallucinating less than its predecessor. But that doesn't seem to be the case for o3 and o4-mini. According to OpenAI's internal tests, o3 and o4-mini, which are so-called reasoning models, hallucinate more often than the company's previous reasoning models — o1, o1-mini, and o3-mini — as well as OpenAI's traditional, "non-reasoning" models, such as GPT-4o. Perhaps more concerning, the ChatGPT maker doesn't really know why it's happening. In its technical report for o3 and o4-mini, OpenAI writes that "more research is needed" to understand why hallucinations are getting worse as it scales up reasoning models. O3 and o4-mini perform better in some areas, including tasks related to coding and math. But because they "make more claims overall," they're often led to make "more accurate claims as well as more inaccurate/hallucinated claims," per the report. OpenAI found that o3 hallucinated in response to 33% of questions on PersonQA, the company's in-house benchmark for measuring the accuracy of a model's knowledge about people. That's roughly double the hallucination rate of OpenAI's previous reasoning models, o1 and o3-mini, which scored 16% and 14.8%, respectively. O4-mini did even worse on PersonQA — hallucinating 48% of the time. Third-party testing by Transluce, a nonprofit AI research lab, also found evidence that o3 has a tendency to make up actions it took in the process of arriving at answers. In one example, Transluce observed o3 claiming that it ran code on a 2021 MacBook Pro "outside of ChatGPT," then copied the numbers into its answer. While o3 has access to some tools, it can't do that. "Our hypothesis is that the kind of reinforcement learning used for o-series models may amplify issues that are usually mitigated (but not fully erased) by standard post-training pipelines," said Neil Chowdhury, a Transluce researcher and former OpenAI employee, in an email to TechCrunch. Sarah Schwettmann, co-founder of Transluce, added that o3's hallucination rate may make it less useful than it otherwise would be. Kian Katanforoosh, a Stanford adjunct professor and CEO of the upskilling startup Workera, told TechCrunch that his team is already testing o3 in their coding workflows, and that they've found it to be a step above the competition. However, Katanforoosh says that o3 tends to hallucinate broken website links. The model will supply a link that, when clicked, doesn't work. Hallucinations may help models arrive at interesting ideas and be creative in their "thinking," but they also make some models a tough sell for businesses in markets where accuracy is paramount. For example, a law firm likely wouldn't be pleased with a model that inserts lots of factual errors into client contracts. One promising approach to boosting the accuracy of models is giving them web search capabilities. OpenAI's GPT-4o with web search achieves 90% accuracy on SimpleQA. Potentially, search could improve reasoning models' hallucination rates, as well — at least in cases where users are willing to expose prompts to a third-party search provider. If scaling up reasoning models indeed continues to worsen hallucinations, it'll make the hunt for a solution all the more urgent. "Addressing hallucinations across all our models is an ongoing area of research, and we're continually working to improve their accuracy and reliability," said OpenAI spokesperson Niko Felix in an email to TechCrunch. In the last year, the broader AI industry has pivoted to focus on reasoning models after techniques to improve traditional AI models started showing diminishing returns. Reasoning improves model performance on a variety of tasks without requiring massive amounts of computing and data during training. Yet it seems reasoning also may lead to more hallucinating — presenting a challenge. This article originally appeared on TechCrunch at Sign in to access your portfolio

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into the world of global news and events? Download our app today from your preferred app store and start exploring.
app-storeplay-store