
Why AI Benchmarking Needs A Rethink
AI models are evolving at breakneck speed, but the methods for measuring their performance remain stagnant and the real-world consequences are significant. AI models that haven't been thoroughly tested could result in inaccurate conclusions, missed opportunities and costly errors.
As AI adoption accelerates, it's becoming clear that current testing frameworks fall short in assessing real-world reasoning capabilities—pointing to an urgent need for improved evaluation standards.
The Limitations Of Current AI Benchmarks
Traditional AI benchmarks are structured to evaluate basic tasks, such as factual recall and fluency, which are easy to measure. However, advanced capabilities like causal reasoning—the ability to identify cause-and-effect relationships—are more difficult to assess systematically despite their importance in everyday AI applications.
While most benchmarks are useful in gauging an AI model's capacity to process and reproduce information, they fail to assess whether the model is truly 'reasoning' or merely recognizing patterns from its training data. Understanding this distinction is crucial because, as S&P Global research notes, AI's reasoning ability directly impacts its applicability in tasks like problem-solving, decision making and generating insights that go beyond simple data retrieval.
Additionally, the prompts used to evaluate the majority of AI capabilities are primarily in English, neglecting the diverse linguistic and cultural contexts of the global marketplace. This limitation is especially relevant as AI models are increasingly deployed around the world, where the demands for accuracy and consistency are vital across languages, as recently discussed by Stanford Assistant Professor Sanmi Koyejo.
The Multilingual Blind Spot
Most datasets used for evaluating causal reasoning are designed with English as the primary language, leaving models' abilities to reason about cause-and-effect relationships in other languages largely untested.
Languages exhibit significant diversity in their grammatical structures, morphological systems and other linguistic features. If the models have not been sufficiently exposed to these differences, their ability to identify causality can be impacted.
My company, Welo Data, conducted an independent benchmarking study across over 20 large language models (LLMs) from 10 different developers, revealing just how significant this issue is. The evaluation used story-based prompts that required contextual reasoning to test advanced causal reasoning capabilities across languages, including English, Spanish, Japanese, Korean, Turkish and Arabic. The results: LLMs often struggled with these complex causal inference tasks, especially when tested in languages other than English
Many models showed inconsistent results when interpreting the same logical scenario in different languages. This inconsistency suggests that these models fail to account for linguistic differences in the way humans reason and convey causality. If AI is to be useful across diverse languages, benchmarking frameworks must evolve to test language model proficiency across linguistic boundaries.
The Causal Reasoning Gap
Causal reasoning is a crucial aspect of human intelligence, allowing us to understand what happens and why it happens. The study found that many AI models still struggle with this fundamental capability, particularly in multilingual contexts. While these models excel at pattern recognition, they often fail to effectively identify causal relationships in scenarios that require multistep reasoning.
This gap is a significant limitation when deploying these models in real-world scenarios, such as healthcare, finance or customer support, where accurate and nuanced decision making is critical. Existing benchmarks often simplify cause-and-effect scenarios to tasks that rely on well-established datasets or pre-defined solutions, making it difficult to determine whether the model is truly reasoning or simply reproducing learned patterns.
One promising direction involves using more complex, human-crafted testing scenarios designed to require genuine causal inference rather than pattern recognition. By incorporating such methods into evaluation frameworks, organizations can more clearly identify where models fall short—especially in multilingual or high-stakes applications—and take targeted steps to improve performance.
A New Way Forward: Evolving AI Testing For The Future
To truly understand how AI models will perform in functional settings, testing methodologies must assess the full range of cognitive abilities required in human-like reasoning. There are several ways companies can adopt better AI testing methodologies:
• Implement a multilingual approach. AI models must be evaluated across multiple languages to ensure they can handle the complexities of global communication. This is especially important for companies operating in diverse markets or serving international customers.
• Incorporate complex, real-world scenarios. Focus on evaluating AI through scenarios where multiple factors and variables interact, allowing for an accurate measurement of AI's capabilities.
• Emphasize causal reasoning with novel data. Prioritize assessing causal reasoning abilities using previously unseen scenarios and examples that require genuine understanding of cause-and-effect relationships. This ensures the AI is demonstrating true causal inference rather than pattern matching or recalling information from its training data.
Paving The Way: Building Better AI
Existing benchmarks often do not accurately assess the full range of AI's capabilities, which can leave businesses with incomplete or misleading information about how their AI models perform, depending on which benchmarks are used and their specific objectives.
As AI continues to evolve, so too must the methods used to evaluate its performance.
By adopting a comprehensive, multilingual and real-world testing approach, we can ensure that AI models are not only capable but also reliable and equitable across diverse languages and contexts. It's time to rethink AI benchmarking—and, with that, the future of AI itself.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles
Yahoo
27 minutes ago
- Yahoo
Many people using OTC birth control pills previously used nothing, study finds
Two years after the U.S. Food and Drug Administration approved the first over-the-counter birth control pill, new research is looking at who's switching to it and why. In the study, published Monday in JAMA Network Open, researchers used survey data from 986 people, ages 15 to 45, in 44 states who obtained the over-the-counter pill either online or at a pharmacy. They found that a significant portion of users shifted to the over-the-counter pill from a less-effective method of birth control or from using no contraception at all. Of those surveyed, they found a 31.8 percentage point increase in use by people who previously used no contraceptive method. A 41 percentage point increase was seen in those who switched from a less-effective method, like condoms or emergency contraception. Opill, the over-the-counter, progestin-only pill from drugmaker Perrigo, provides an option for obtaining oral contraceptives without needing to first see a health care provider. Allowing people to access the pill without a prescription was done in hopes of reducing barriers to access, according to the FDA's news release at the time of approval, which noted that almost half of the 6.1 million pregnancies in the U.S. each year are unintended. The new research "is one of the first studies to show that over-the-counter birth control pills are reaching the very people they're meant to help — those who face the greatest barriers to care," lead author Dr. Maria Rodriguez, professor of obstetrics and gynecology in the Oregon Health & Science University School of Medicine, said in a news release. Those accessing the over-the-counter pill were more likely than prescription users to be uninsured, younger (ages 15-20) and living in rural areas, according to the study. The most common reason people gave in the survey for choosing the OTC pill was that it didn't require an appointment, followed by those who said they didn't have a regular physician. "At a time when pregnancy is becoming even more dangerous in the United States — especially for people of color, those with low incomes, and those living in rural communities — our findings underscore that OTC contraception is a powerful tool for reproductive autonomy," Rodriguez said. Alaska Sen. Dan Sullivan on the Trump-Putin summit, sanctions and more Laufey on creating her own sound A robotics activist's remarkable crusade Solve the daily Crossword
Yahoo
27 minutes ago
- Yahoo
Gold Wavers as Traders Look to Jackson Hole and Ukraine Talks
(Bloomberg) -- Gold wavered as traders looked ahead to potential interest-rate signals from the Federal Reserve's annual gathering in Jackson Hole, as well as high-stakes diplomacy in Washington over efforts to end the war in Ukraine. Bullion traded in a narrow range over the past few sessions. Central bankers from around the world will gather at the retreat in Wyoming starting Friday, with markets largely expecting a reduction in US rates at the Fed's policy meeting next month. A Photographer's Pipe Dream: Capturing New York's Vast Water System Festivals and Parades Are Canceled Amid US Immigration Anxiety A London Apartment Tower With Echoes of Victorian Rail and Ancient Rome Princeton Plans New Budget Cuts as Pressure From Trump Builds The Fed's Raphael Bostic said after a tour of the southeastern US that he's open to adjusting rates soon, citing strains from Donald Trump's import tariffs and high borrowing costs squeezing business profits. Lower rates typically benefit gold because the precious metal doesn't bear interest. 'Markets increasingly expect the Federal Reserve to strike a more dovish tone at the upcoming Jackson Hole symposium, with traders largely looking past last week's slightly firmer U.S. inflation print,' said Priyanka Sachdeva, an analyst at Phillip Nova in Singapore. 'The broader market view is that inflationary pressures remain on a cooling path.' Meanwhile, Ukrainian leader Volodymyr Zelenskiy and his European allies were arriving at the White House to meet with Trump following his summit with Russia's Vladimir Putin last week, amid apprehension that he'll try to force Kyiv into making unpalatable concessions. Any signs of a ceasefire could ease demand for the precious metal as a haven. Gold has rallied by more than a quarter this year, reaching a record in April. Since then, prices have tracked sideways, with investors following the fallout from the US-led trade war, concerns over the strength of the global economy and geopolitical tensions. Central-bank buying has also helped to support bullion. Gold was little-changed at $3,333.06 an ounce as of 12:37 p.m. in New York. The Bloomberg Dollar Spot Index was up 0.2%. Silver, platinum and palladium edged higher. --With assistance from Laura Avetisyan and Yvonne Yue Li. Foreigners Are Buying US Homes Again While Americans Get Sidelined What Declining Cardboard Box Sales Tell Us About the US Economy Americans Are Getting Priced Out of Homeownership at Record Rates Living With 12 Strangers to Ease a Housing Crunch Bessent on Tariffs, Deficits and Embracing Trump's Economic Plan ©2025 Bloomberg L.P.
Yahoo
27 minutes ago
- Yahoo
Nvidia's $4.5 Trillion Market Cap Tops All 2,000 Russell Small-Caps Combined
Nvidia (NASDAQ:NVDA) just hit another mind-bending milestone. The chip giant's market cap has swelled to $4.5 trillion, putting it $1.5 trillion above the entire Russell 2000 index. That means one company is now worth more than 2,000 small U.S. firms combined. Warning! GuruFocus has detected 5 Warning Signs with NVDA. It's a snapshot of where the market's energy really is. While small-caps grind along, mega-cap tech led by Nvidia keeps sucking up the oxygen. The comparison shows how lopsided the rally has become, with AI demand fueling outsized gains for a handful of giants. Nvidia isn't just dominating in price. It's woven into the core of global investing. The stock sits inside 667 ETFs, which together control nearly 3.6 billion shares. Whether through passive funds or active bets, chances are, most investors already own a slice of Nvidia. That reach makes every tick of its stock ripple across portfolios worldwide. Nvidia isn't just a chipmaker anymore it's become a market-moving force, one whose valuation now rivals entire economies. This article first appeared on GuruFocus. Sign in to access your portfolio