
The methodology to judge AI needs realignment
When Anthropic released Claude 4 a week ago, the artificial intelligence (AI) company said these models set 'new standards for coding, advanced reasoning, and AI agents'. They cite leading scores on SWE-bench Verified, a benchmark for performance on real software engineering tasks. OpenAI also claims the o3 and o4-mini models return best scores on certain benchmarks. As does Mistral, for the open-source Devstral coding model.
AI companies flexing comparative test scores is a common theme.
The world of technology has for long obsessed over synthetic benchmark test scores. Processor performance, memory bandwidth, speed of storage, graphics performance — plentiful examples, often used to judge whether a PC or a smartphone was worth your time and money.
Yet, experts believe it may be time to evolve methodology for AI testing, rather than a wholesale change.
American venture capitalist Mary Meeker, in the latest AI Trends report, notes that AI is increasingly doing better than humans in terms of accuracy and realism. She points to the MMLU (Massive Multitask Language Understanding) benchmark, which averages AI models at 92.30% accuracy compared with a human baseline of 89.8%.
MMLU is a benchmark to judge a model's general knowledge across 57 tasks covering professional and academic subjects including math, law, medicine and history.
Benchmarks serve as standardised yardsticks to measure, compare, and understand evolution of different AI models. Structured assessments that provide comparable scores for different models. These typically consist of datasets containing thousands of curated questions, problems, or tasks that test particular aspects of intelligence.
Understanding benchmark scores requires context about both scale and meaning behind numbers. Most benchmarks report accuracy as a percentage, but the significance of these percentages varies dramatically across different tests. On MMLU, random guessing would yield approximately 25% accuracy since most questions are multiple choice. Human performance typically ranges from 85-95% depending on subject area.
Headline numbers often mask important nuances. A model might excel in certain subjects, more than others. An aggregated score may hide weaker performance on tasks requiring multi-step reasoning or creative problem-solving, behind strong performance on factual recall.
AI engineer and commentator Rohan Paul notes on X that 'most benchmarks don't reward long-term memory, rather they focus on short-context tasks.'
Increasingly, AI companies are looking closely at the 'memory' aspect. Researchers at Google, in a new paper, detail an attention technique dubbed 'Infini-attention', to configure how AI models extend their 'context window'.
Mathematical benchmarks often show wider performance gaps. While most latest AI models score over 90% on accuracy, on the GSM8K benchmark (Claude Sonnet 3.5 leads with 97.72% while GPT-4 scores 94.8%), the more challenging MATH benchmark sees much lower ratings in comparison — Google Gemini 2.0 Flash Experimental with 89.7% leads, while GPT-4 scores 84.3%; Sonnet hasn't been tested yet).
Reworking the methodology
For AI testing, there is a need to realign testbeds. 'All the evals are saturated. It's becoming slightly meaningless,' the words of Satya Nadella, chairman and chief executive officer (CEO) of Microsoft, while speaking at venture capital firm Madrona's annual meeting, earlier this year.
The tech giant has announced they're collaborating with institutions including Penn State University, Carnegie Mellon University and Duke University, to develop an approach to evaluate AI models that predicts how they will perform on unfamiliar tasks and explain why, something current benchmarks struggle to do.
An attempt is being made to make benchmarking agents for dynamic evaluation of models, contextual predictability, human-centric comparatives and cultural aspects of generative AI.
'The framework uses ADeLe (annotated-demand-levels), a technique that assesses how demanding a task is for an AI model by applying measurement scales for 18 types of cognitive and knowledge-based abilities,' explains Lexin Zhou, Research Assistant at Microsoft.
Momentarily, popular benchmarks include SWE-bench (or Software Engineering Benchmark) Verified to evaluate AI coding skills, ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) to judge generalisation and reasoning, as well as LiveBench AI that measures agentic coding tasks and evaluates LLMs on reasoning, coding and math.
Among limitations that can affect interpretation, many benchmarks can be 'gamed' through techniques that improve scores without necessarily improving intelligence or capability. Case in point, Meta's new Llama models.
In April, they announced an array of models, including Llama 4 Scout, the Llama 4 Maverick, and still-being-trained Llama 4 Behemoth. Meta CEO Mark Zuckerberg claims the Behemoth will be the 'highest performing base model in the world'. Maverick began ranking above OpenAI's GPT-4o in LMArena benchmarks, and just below Gemini 2.5 Pro.
That is where things went pear shaped for Meta, as AI researchers began to dig through these scores. Turns out, Meta had shared a Llama 4 Maverick model that was optimised for this test, and not exactly a spec customers would get.
Meta denies customisations. 'We've also heard claims that we trained on test sets — that's simply not true and we would never do that. Our best understanding is that the variable quality people are seeing is due to needing to stabilise implementations,' says Ahmad Al-Dahle, VP of generative AI at Meta, in a statement.
There are other challenges. Models might memorise patterns specific to benchmark formats rather than developing genuine understanding. The selection and design of benchmarks also introduces bias.
There's a question of localisation. Yi Tay, AI Researcher at Google AI and DeepMind has detailed one such regional-specific benchmark called SG-Eval, focused on helping train AI models for wider context. India too is building a sovereign large language model (LLM), with Bengaluru-based AI startup Sarvam, selected under the IndiaAI Mission.
As AI capabilities continue advancing, researchers are developing evaluation methods that test for genuine understanding, robustness across context and capabilities in the real-world, rather than plain pattern matching. In the case of AI, numbers tell an important part of the story, but not the complete story.
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles


Time of India
40 minutes ago
- Time of India
Anthropic CEO says proposed 10-year ban on state AI regulation 'too blunt' in NYT op-ed
Dario Amodei instead called for the White House and Congress to work together on a transparency standard for AI companies at a federal level, so that emerging risks are made clear to the people. Tired of too many ads? Remove Ads Tired of too many ads? Remove Ads A Republican proposal to block states from regulating artificial intelligence for 10 years is "too blunt," Anthropic Chief Executive Officer Dario Amodei wrote in a New York Times ' opinion instead called for the White House and Congress to work together on a transparency standard for AI companies at a federal level, so that emerging risks are made clear to the people."A 10-year moratorium is far too blunt an instrument. AI is advancing too head-spinningly fast," Amodei said."Without a clear plan for a federal response, a moratorium would give us the worst of both worlds - no ability for states to act, and no national policy as a backstop."The proposal, included in President Donald Trump's tax cut bill, aims to preempt AI laws and regulations passed recently in dozens of states, but has drawn opposition from a bipartisan group of attorneys general that have regulated high-risk uses of the a national standard would require developers working on powerful models to adopt policies for testing and evaluating their models and to publicly disclose how they plan to test for and mitigate national security and other risks, according to Amodei's opinion a policy, if adopted, would also mean developers would have to be upfront about the steps they took to make sure their models were safe before releasing them to the public, he said Anthropic already releases such information and competitors OpenAI and Google DeepMind have adopted similar incentives to ensure that these companies keep disclosing such details could become necessary as corporate incentive to provide this level of transparency might change in light of models becoming more powerful, he argued.


Time of India
an hour ago
- Time of India
8 ways prompt engineering can be your career lifeline in the age of AI
While Artificial Intelligence (AI) may have sounded the death knell for a number of jobs, it has also introduced new jobs, doing what technology often does - disrupting but also, building. The latest career in town, which is a result of the advent of AI, is Prompt engineering. A prompt engineer is basically responsible for creating suitable prompts for an AI chatbot such as ChatGPT or, a generative AI large-language model aka LLM, to extract valuable information for an employer or Unsplash For decades, the rules of employability were predictable—earn a degree, gain experience, climb the ladder. But the AI revolution has swiftly redrawn those rules. Entire industries are being transformed, and with them, the skills needed to survive and thrive. Among the emerging game-changers is a discipline few anticipated but many are now racing to master: Prompt engineering. This new-age skill doesn't require you to write algorithms or code neural networks. Instead, it demands an artful fusion of language, logic, and problem-solving—turning natural language into a powerful tool to unlock the full potential of generative AI. As ChatGPT, Claude, and other large language models become ubiquitous, prompt engineers are fast becoming the human translators of machine intelligence. Here's why prompt engineering might just be your career lifeline in the AI-powered future: No Code? no problem. It levels the playing field Unlike traditional tech roles that demand years of programming expertise, prompt engineering offers a more accessible path. If you can think critically, write clearly, and understand the nuances of language, you're already halfway there. This means creatives, educators, marketers, journalists—even students—can pivot into AI roles without starting from scratch. It's the interface skill AI can't replace (yet) Ironically, as AI becomes more intelligent, it needs better human guidance. Large Language Models (LLMs) don't 'understand' tasks the way humans do, they interpret input based on patterns. That's where prompt engineers shine. Their job is to communicate human intent with clarity and precision, making them indispensable in AI development, application, and troubleshooting. Industries are already hiring for it From Fortune 500 companies to agile startups, businesses are urgently looking for professionals who can get the most out of AI systems. Job boards now list roles like 'Prompt Engineer,' 'AI Interaction Designer,' and 'LLM Strategist.' In 2024, Anthropic reportedly offered salaries up to $375,000 for prompt engineering experts. The demand is only expected to grow across sectors like finance, healthcare, education, and entertainment. It enhances every job you already do You don't have to abandon your current profession to benefit. Journalists use it to summarize complex data. Teachers use it to personalise learning. Lawyers draft contracts faster. The better your prompts, the more effective your AI assistant. Prompt engineering isn't just a job, it's a force multiplier for almost every white-collar role. It encourages lifelong learning over formal degrees The field is evolving so rapidly that agility matters more than academic pedigree. Success in prompt engineering rewards curiosity, experimentation, and iterative learning. Online courses, community experiments, and real-time feedback loops are the new classrooms. Those willing to learn continuously, rather than rest on credentials, will flourish. It teaches you to think like a machine—and a human Mastering prompts means understanding both human psychology and machine logic. It hones skills in abstraction, structured thinking, and scenario-based reasoning. These cognitive muscles are not just useful, they're futureproof. As machines grow more intelligent, the ability to steer their reasoning with intent will remain an irreplaceable human edge. You become a creative collaborator with AI, not its victim Instead of fearing displacement, prompt engineers are co-creating the future with AI. They're scripting music, building prototypes, writing books, designing games, and even generating legal frameworks—all via thoughtful, engineered prompts. You're not competing with AI—you're working with it, as an empowered orchestrator of its capabilities. It's the gateway to deeper AI literacy Prompt engineering often serves as a gateway skill that leads to deeper exploration into AI ethics, data bias, model architecture, and more. Many professionals who begin with prompt writing eventually evolve into AI product designers, analysts, or policy advisors. In that sense, prompt engineering is a springboard—not a ceiling. Rethinking career security in an AI world The age of AI won't just reward technical brilliance—it will reward the ability to communicate, collaborate, and co-create with machines. Prompt engineering embodies that shift. It's not just a skillset, it's a mindset. And for those looking to future-proof their careers in an age of uncertainty, it might be the most important tool in your kit. So, the question is no longer: Will AI take my job? It's: Am I ready to talk to AI well enough to keep it? Ready to navigate global policies? Secure your overseas future. Get expert guidance now!


Economic Times
an hour ago
- Economic Times
OpenAI finds more Chinese groups using ChatGPT for malicious purposes
OpenAI is seeing an increasing number of Chinese groups using its artificial intelligence technology for covert operations, which the ChatGPT maker described in a report released Thursday. While the scope and tactics employed by these groups have expanded, the operations detected were generally small in scale and targeted limited audiences, the San Francisco-based startup said. Since ChatGPT burst onto the scene in late 2022, there have been concerns about the potential consequences of generative AI technology, which can quickly and easily produce human-like text, imagery and audio. OpenAI regularly releases reports on malicious activity it detects on its platform, such as creating and debugging malware, or generating fake content for websites and social media platforms. In one example, OpenAI banned ChatGPT accounts that generated social media posts on political and geopolitical topics relevant to China, including criticism of a Taiwan-centric video game, false accusations against a Pakistani activist, and content related to the closure of USAID. Some content also criticised US President Donald Trump's sweeping tariffs, generating X posts, such as "Tariffs make imported goods outrageously expensive, yet the government splurges on overseas aid. Who's supposed to keep eating?". In another example, China-linked threat actors used AI to support various phases of their cyber operations, including open-source research, script modification, troubleshooting system configurations, and development of tools for password brute forcing and social media automation. A third example OpenAI found was a China-origin influence operation that generated polarized social media content supporting both sides of divisive topics within U.S. political discourse, including text and AI-generated profile images. China's foreign ministry did not immediately respond to a Reuters request for comment on OpenAI's findings. OpenAI has cemented its position as one of the world's most valuable private companies after announcing a $40 billion funding round valuing the company at $300 billion.