Benchmarks in medicine: the promise and pitfalls of evaluating AI tools with mismatched yardsticks

12-06-2025

In May 2024, OpenAI released HealthBench, a new benchmarking system to test the clinical capabilities of large language models (LLMs) such as ChatGPT. On the surface, this may sound like yet another technical update. But for the medical world, it marked an important moment—a quiet acknowledgement that our current ways of evaluating medical AI are fundamentally wrong.
Headlines in the recent past have trumpeted that AI 'outperforms doctors' or 'aces medical exams.' The impression that's coming through is these models are smarter, faster, and perhaps even safer. But this hype masks a deeper truth. To put it plainly, the benchmarks used to arrive at these claims are based on exams built for evaluating human memory retention from classroom teachings. They reward fact recall, not clinical judgment.
A calculator problem
A calculator can multiply two six-digit numbers within seconds. Impressive, no doubt. But does this mean calculators are better than, and understand maths more than mathematics experts ? Or better even than an ordinary person who takes a few minutes to do the calculation with a pen and paper?
Language models are celebrated because they can churn out textbook-style answers to MCQs and fill in the blanks for medical facts and questions faster than medical professors. But the practice of medicine is not a quiz. Real doctors deal with ambiguity, emotion, and decision-making under uncertainty. They listen, observe, and adapt.
The irony is that while AI beats doctors in answering questions, it still struggles to generate the very case vignettes that form the basis of those questions. Writing a good clinical scenario from real patients in clinical practice requires understanding human suffering, filtering irrelevant details, and framing the diagnostic dilemma with context. So far, that remains a deeply human ability.
Also Read: Why AI in healthcare needs stringent safety protocols
What existing benchmarks miss
Most widely-used benchmarks—MedQA, PubMedQA, MultiMedQA—pose structured questions with one 'correct' answer or have fill in the blanks questions. They evaluate factual accuracy but overlook human nuance. A patient doesn't say, 'I have been using a faulty chair and sitting in the wrong posture for long hours and have a non-specific backache ever since I bought it. So please choose the best diagnosis and give appropriate treatment.' They just say, 'Doctor, I'm tired. I don't feel like myself.' That is where the real work begins.
Clinical environments are messy. Doctors deal with overlapping illnesses, vague symptoms, incomplete notes, and patients who may be unable—or unwilling—to tell the full story. Communication gaps, emotional distress, and even socio-cultural factors influence how care unfolds. And yet, our evaluation metrics continue to look for precision, clarity, and correctness—things that the real world rarely provides.
Benchmarking vs reality
It can be easy to decide who the best batter in the world is, by only counting runs scored. Similarly, bowlers can be ranked by the number of wickets taken. But answering the question 'Who is the best fielder?' might not be as simple. Measuring fielding is very subjective and evades simple numbers. The number of runs outs assisted or catches taken only tells part of the story. The efforts made at the boundary line to reduce runs or mere intimidation through the presence of the fielders (like Jonty Rhodes or R. Jadeja) preventing runs at covers or points can't be measured easily.
Healthcare is like fielding: it is qualitative, often invisible, deeply contextual, and hard to quantify. Any benchmark that pretends otherwise will mislead more than it illuminates.
This is not a new problem. In 1946, the civil servant Sir Joseph Bhore, when consulted to reform India's healthcare said, 'If it were possible to evaluate the loss, which this country annually suffers through the avoidable waste of valuable human material and the lowering of human efficiency through malnutrition and preventable morbidity, we feel that the result would be so startling that the whole country would be aroused and would not rest until a radical change had been brought about'. This quote reflects a longstanding dilemma—how to measure what truly matters in health systems. Even after 80 years, we have not found perfect evaluation metrics.
What HealthBench does
HealthBench at least acknowledges this disconnect. Developed by OpenAI in collaboration with clinicians, it moves away from traditional multiple-choice formats. It is also the first benchmark to explicitly score responses using 48,562 unique rubric criteriaranging from minus 10 to plus 10, reflecting some aspects of real-world stakes of clinical decision-making. A dangerously wrong answer must be punished more harshly than a mildly useful one. This, finally, mirrors medicine's moral landscape.
Even so, HealthBench has limitations. It evaluates performance across just 5,000 'simulated' clinical cases, of which only 1,000 are classified as 'difficult.' That is a vanishingly small slice of clinical complexity. Though commendably global, its doctor-rater pool includes just 262 physicians from 60 countries in 52 languages, with varying professional experience and cultural backgrounds (three Physicians from India participated, and simulations from 11 Indian languages were generated). HealthBench Hard, a challenging subset of 1,000 cases, revealed that many existing models scored zero—highlighting their inability to handle complex clinical reasoning. Moreover, these cases are still simulations. Thus, the benchmark is an improvement, not a revolution.
Also Read: Artificial Intelligence in healthcare: what lies ahead
Predictive AI's collapse in the real world
This is not just about LLMs. Predictive models have faced similar failures. The sepsis prediction tool, developed by EPIC to flag early signs of sepsis, showed initial promise a few years ago. However, once deployed, it could not meaningfully improve outcomes. Another company that claimed to have developed a detection algorithm for liver transplantation recipients folded quietly after its model showed bias against young patients in Britain. It failed in the real world despite glowing performances on benchmark datasets. Why? Because predicting rare/critical events requires context-aware decision-making. A seemingly unknown determinant may lead to wrong predictions and unnecessary ICU admissions. The cost of error is high—and humans often bear it.
What makes a good benchmark?
A robust medical benchmark should meet four criteria:
Represent reality: Include incomplete records, contradictory symptoms, and noisy environments.
Test communication: Measure how well a model explains its reasoning, not just what answer it gives.
Handle edge cases: Evaluate performance on rare, ethically complex, or emotionally charged scenarios.
Reward safety over certainty: Penalise overconfident wrong answers more than humble uncertainty.
Currently, most benchmarks miss these criteria. And without these elements, we risk trusting technically smart but clinically naïve models.
Red teaming the models
One way forward is red teaming—a method borrowed from cybersecurity, where systems are tested against ambiguous, edge-case, or morally complex scenarios. For example: a patient in mental distress whose symptoms may be somatic; an undocumented illegal immigrant fearful of disclosing travel history; a child with vague neurological symptoms and an anxious parent pushing for a CT scan; a pregnant woman with religious objections to blood transfusion; a terminal cancer patient is unsure whether to pursue aggressive treatment or palliative care; a patient feigning for personal gain.
In these edge cases, models must go beyond knowledge. They must display judgment—or, at the very least, know when they don't know. Red teaming does not replace benchmarks. But it adds a deeper layer, exposing overconfidence, unsafe logic, or lack of cultural sensitivity. These flaws matter more than ticking the right answer box in real-world medicine. Red teaming forces models to reveal what they know and how they think. It uncovers these aspects, which may be hidden in benchmark scores.
Why this matters
The core tension is this: medicine is not just about getting answers right. It is about getting people right. Doctors are trained to deal with doubts, handle exceptions, and recognise cultural patterns not taught in books (doctors also miss a lot). AI, by contrast, is only as good as the data it has seen and the questions it has been trained on. HealthBench, for all its flaws, is a small but vital course correction. It recognises that evaluation needs to change. It introduces a better scoring rubric. It asks harder questions. That makes it better. But we must remain cautious. Healthcare is not like image recognition or language translation. A single incorrect model output can mean a lost life and a ripple effect—misdiagnoses, lawsuits, data breaches, and even health crises. In the age of data poisoning and model hallucination, the stakes are existential.
The road ahead
We must stop asking if AI is better than doctors. That is not the right question. Instead, we should ask: Where is AI safe, useful, and ethical to deploy—and where is it not? Benchmarks, if thoughtfully redesigned, can help answer that. AI in healthcare is not a competition to win. It is a responsibility to share. We must stop treating model performance as a leaderboard sport and start thinking of it as a safety checklist. Until then, AI can assist. It can summarise. It can remind. However, it cannot replace clinical judgment's moral and emotional weight. It certainly cannot sit beside a dying patient and know when to speak and when to stay silent.
(Dr. C. Aravinda is an academic and public health physician. The views expressed are personal. aravindaaiimsjr10@hotmail.com)

Hashtags

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Lenovo charts hybrid course in a cloud market ruled by titans

The Hindu

3 minutes ago

The Hindu

Lenovo charts hybrid course in a cloud market ruled by titans

In a global cloud infrastructure market valued at over $330 billion and dominated by a handful of hyperscalers, Lenovo is forging a distinct and strategic path. Rather than clashing head-on with giants like Amazon Web Services, Microsoft, and Google Cloud, the company's Infrastructure Solutions Group (ISG) is positioning itself as the indispensable foundational partner for a new era of hybrid, AI-driven computing, according to Amit Luthra, Managing Director for Lenovo ISG in India. 'Cloud is the form of the way IT is consumed,' Mr. Luthra said. 'Our strategy is to work with the complete ecosystem of the cloud—from hyperscalers and local service providers to large enterprises and SMBs.' This philosophy of collaboration over direct competition hinges on providing flexible, outcome-driven infrastructure. Mr. Luthra emphasised that the company has moved beyond simply selling hardware. The focus now is on delivering tangible results, a shift that requires a deep understanding of diverse customer needs and an open, interoperable architecture. 'We don't want to go back to our customers and say you have to let go of what you already have,' he explained, highlighting a core principle of avoiding proprietary lock-in. This allows organisations to build upon their existing IT investments, whether they are traditional on-premise systems or complex multi-cloud environments. This flexibility is particularly crucial when addressing the full spectrum of the market, from sprawling enterprises to agile small and medium-sized businesses (SMBs). Mr. Luthra pointed to a 'build vs. buy' dynamic, where many SMBs lack the resources for custom development. 'Most of the mid-market and SMB customers in India will want to buy IT — buy it like a package,' he noted. To serve this segment, Lenovo has cultivated a robust ecosystem of Independent Software Vendors (ISVs), enabling the delivery of pre-validated, turnkey solutions. 'We do have lot many solutions like these... to help them give a complete end-to-end packet solution,' Mr. Luthra added, citing examples like a 'ChatGPT in a box' for enterprises seeking a private, secure large language model. The catalyst for this market evolution, Mr. Luthra asserts, has been the explosion of artificial intelligence. While AI and analytics have been around for over a decade, the recent wave of generative models has fundamentally altered the landscape. 'GenAI took it to the mainstream. It became a boardroom imperative,' he stated. According to a global CIO survey conducted by Lenovo, India has emerged as a leader in AI adoption within the Asia-Pacific region. Indian organisations are leveraging the technology for both optimising existing processes — what Mr. Luthra calls improving 'business hygiene'—and driving disruptive innovation. Looking ahead, he believes the industry's focus is already shifting from the initial frenzy of training massive models to applying them for practical use. 'The big bet that I would have is on inferencing,' Mr. Luthra declared. 'The era of training and LLMs was a two years back story... while that is still on, inferencing is where the big bet is.' With this rapid technological advance, however, comes the critical responsibility of governance. Mr. Luthra said that security, privacy, and ethics are now central to every major technology discussion. 'Responsible AI is a very, very big discussion point in deploying most of these solutions,' he said. 'People want to talk about security, people want to talk about ethics. This is where IT makes the excitement in the board real.'

Musk's bid to dismiss OpenAI's harassment claims denied in court

Hindustan Times

4 hours ago

Hindustan Times

Musk's bid to dismiss OpenAI's harassment claims denied in court

A federal judge on Tuesday denied Elon Musk's bid to dismiss OpenAI's claims of a "years-long harassment campaign" by the Tesla CEO against the company he co-founded in 2015 and later abandoned before ChatGPT became a global phenomenon. Musk sued OpenAI and its CEO Sam Altman last year over the company's transition to a for-profit model.(REUTERS) In the latest turn in a court battle that kicked off last year, US District Judge Yvonne Gonzalez Rogers ruled that Musk must face OpenAI's claims that the billionaire, through press statements, social media posts, legal claims and "a sham bid for OpenAI's assets" had attempted to harm the AI startup. Musk sued OpenAI and its CEO Sam Altman last year over the company's transition to a for-profit model, accusing the company of straying from its founding mission of developing AI for the good of humanity, not profit. OpenAI countersued Musk in April, accusing the billionaire of engaging in fraudulent business practices under California law. Musk then asked for OpenAI's counterclaims to be dismissed or delayed until a later stage in the case. OpenAI argued in May its countersuit should not be put on hold, and the judge on Tuesday concluded that the company's allegations were legally sufficient to proceed. A jury trial has been scheduled for spring 2026.

High school maths trumps Olympiad gold medalist AI models: Google Deepmind CEO answers why

Economic Times

7 hours ago

Economic Times

High school maths trumps Olympiad gold medalist AI models: Google Deepmind CEO answers why

Google Deepmind chief executive Demis Hassabis said that advanced AI models like Gemini can surpass benchmarks like the International Mathematical Olympiad (IMO) but struggle with basic high school maths problems due to inconsistencies. "The lack of consistency in AI is a major barrier to achieving artificial general intelligence (AGI), " he said on the "Google for Developers" podcast, adding that it is a major roadblock in the journey. Artificial general intelligence, or AGI, is generally understood as software that has the general cognitive abilities of human beings and can perform any task that a human can. He also referred to Google CEO Sundar Pichai's description of the current state of AI as "AJI", or artificial jagged intelligence, where systems excel in certain tasks but fail in others. Road towards AGI The Deepmind CEO said just increasing data and computing power won't suffice to solve the problem at highlighted that rigorous testing and challenging benchmarks can precisely measure an AI model's accurate progress."We need better testing and new, more challenging benchmarks to determine precisely what the models excel at and what they don't." Also Read: AI helps Big Tech score big numbers Not just Google ET reported that artificial intelligence (AI) agents, hailed as the "next big thing" by major tech players like Google, OpenAI, and Anthropic, are expected to be a major focus and trend this year. OpenAI launched Operator, its first AI agent, in January this year, for Pro users across multiple regions, including Australia, Brazil, Canada, India, Japan, Singapore, South Korea, the UK, and most places where ChatGPT is October, Anthropic launched an upgraded version of its Claude 3.5 Sonnet model, which can interact with any desktop application. This AI agent can perform desktop-level commands and browse the web to complete tasks. Also Read: ETtech Explainer | Artificial general intelligence: an enabler or a destroyer