The methodology to judge AI needs realignment

03-06-2025

When Anthropic released Claude 4 a week ago, the artificial intelligence (AI) company said these models set 'new standards for coding, advanced reasoning, and AI agents'. They cite leading scores on SWE-bench Verified, a benchmark for performance on real software engineering tasks. OpenAI also claims the o3 and o4-mini models return best scores on certain benchmarks. As does Mistral, for the open-source Devstral coding model.
AI companies flexing comparative test scores is a common theme.
The world of technology has for long obsessed over synthetic benchmark test scores. Processor performance, memory bandwidth, speed of storage, graphics performance — plentiful examples, often used to judge whether a PC or a smartphone was worth your time and money.
Yet, experts believe it may be time to evolve methodology for AI testing, rather than a wholesale change.
American venture capitalist Mary Meeker, in the latest AI Trends report, notes that AI is increasingly doing better than humans in terms of accuracy and realism. She points to the MMLU (Massive Multitask Language Understanding) benchmark, which averages AI models at 92.30% accuracy compared with a human baseline of 89.8%.
MMLU is a benchmark to judge a model's general knowledge across 57 tasks covering professional and academic subjects including math, law, medicine and history.
Benchmarks serve as standardised yardsticks to measure, compare, and understand evolution of different AI models. Structured assessments that provide comparable scores for different models. These typically consist of datasets containing thousands of curated questions, problems, or tasks that test particular aspects of intelligence.
Understanding benchmark scores requires context about both scale and meaning behind numbers. Most benchmarks report accuracy as a percentage, but the significance of these percentages varies dramatically across different tests. On MMLU, random guessing would yield approximately 25% accuracy since most questions are multiple choice. Human performance typically ranges from 85-95% depending on subject area.
Headline numbers often mask important nuances. A model might excel in certain subjects, more than others. An aggregated score may hide weaker performance on tasks requiring multi-step reasoning or creative problem-solving, behind strong performance on factual recall.
AI engineer and commentator Rohan Paul notes on X that 'most benchmarks don't reward long-term memory, rather they focus on short-context tasks.'
Increasingly, AI companies are looking closely at the 'memory' aspect. Researchers at Google, in a new paper, detail an attention technique dubbed 'Infini-attention', to configure how AI models extend their 'context window'.
Mathematical benchmarks often show wider performance gaps. While most latest AI models score over 90% on accuracy, on the GSM8K benchmark (Claude Sonnet 3.5 leads with 97.72% while GPT-4 scores 94.8%), the more challenging MATH benchmark sees much lower ratings in comparison — Google Gemini 2.0 Flash Experimental with 89.7% leads, while GPT-4 scores 84.3%; Sonnet hasn't been tested yet).
Reworking the methodology
For AI testing, there is a need to realign testbeds. 'All the evals are saturated. It's becoming slightly meaningless,' the words of Satya Nadella, chairman and chief executive officer (CEO) of Microsoft, while speaking at venture capital firm Madrona's annual meeting, earlier this year.
The tech giant has announced they're collaborating with institutions including Penn State University, Carnegie Mellon University and Duke University, to develop an approach to evaluate AI models that predicts how they will perform on unfamiliar tasks and explain why, something current benchmarks struggle to do.
An attempt is being made to make benchmarking agents for dynamic evaluation of models, contextual predictability, human-centric comparatives and cultural aspects of generative AI.
'The framework uses ADeLe (annotated-demand-levels), a technique that assesses how demanding a task is for an AI model by applying measurement scales for 18 types of cognitive and knowledge-based abilities,' explains Lexin Zhou, Research Assistant at Microsoft.
Momentarily, popular benchmarks include SWE-bench (or Software Engineering Benchmark) Verified to evaluate AI coding skills, ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) to judge generalisation and reasoning, as well as LiveBench AI that measures agentic coding tasks and evaluates LLMs on reasoning, coding and math.
Among limitations that can affect interpretation, many benchmarks can be 'gamed' through techniques that improve scores without necessarily improving intelligence or capability. Case in point, Meta's new Llama models.
In April, they announced an array of models, including Llama 4 Scout, the Llama 4 Maverick, and still-being-trained Llama 4 Behemoth. Meta CEO Mark Zuckerberg claims the Behemoth will be the 'highest performing base model in the world'. Maverick began ranking above OpenAI's GPT-4o in LMArena benchmarks, and just below Gemini 2.5 Pro.
That is where things went pear shaped for Meta, as AI researchers began to dig through these scores. Turns out, Meta had shared a Llama 4 Maverick model that was optimised for this test, and not exactly a spec customers would get.
Meta denies customisations. 'We've also heard claims that we trained on test sets — that's simply not true and we would never do that. Our best understanding is that the variable quality people are seeing is due to needing to stabilise implementations,' says Ahmad Al-Dahle, VP of generative AI at Meta, in a statement.
There are other challenges. Models might memorise patterns specific to benchmark formats rather than developing genuine understanding. The selection and design of benchmarks also introduces bias.
There's a question of localisation. Yi Tay, AI Researcher at Google AI and DeepMind has detailed one such regional-specific benchmark called SG-Eval, focused on helping train AI models for wider context. India too is building a sovereign large language model (LLM), with Bengaluru-based AI startup Sarvam, selected under the IndiaAI Mission.
As AI capabilities continue advancing, researchers are developing evaluation methods that test for genuine understanding, robustness across context and capabilities in the real-world, rather than plain pattern matching. In the case of AI, numbers tell an important part of the story, but not the complete story.

Hashtags

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

How to switch back to GPT-4 if you don't like ChatGPT-5 and why users are doing it?

Time of India

2 minutes ago

Time of India

How to switch back to GPT-4 if you don't like ChatGPT-5 and why users are doing it?

OpenAI CEO Sam Altman on Wednesday announced a range of updates to ChatGPT, offering users greater control over the company's latest GPT-5 model and introducing expanded model options for paid subscribers. However, it seems to have not sat well with users, as since the launch of ChatGPT-5, many users have been flooding social media with complaints about the latest update. From slower response times to less engaging and more robotic answers, frustration has been mounting among users. Hence, as the disappointment grows, a significant number are eager to switch back to GPT-4, which they find more reliable and conversational. If you are among those unhappy with the latest update, here's how you can revert to GPT-4 and why so many are making the switch back. Why are so many disappointed with ChatGPT-5? Unlike its predecessors, a backlash has sparked a fresh debate over the psychological attachments some users form with the chatbots trained to push their emotional buttons. While some Reddit users dismissed complaints about GPT-5 as evidence of an unhealthy dependence on an AI companion. OpenAI's GPT-5 was intended to be a groundbreaking upgrade to its widely used and intelligent chatbot. However, for many users, last Thursday's release felt like a disappointing step backwards, as the new ChatGPT appeared to lack its former personality and began making unexpectedly basic errors. Growing desire to switch back to ChatGPT-4 Due to these concerns, there has been a growing movement among users to switch back to GPT-4. Many described GPT-4 as having a more engaging and personalised tone, better understanding context and providing more detailed, accurate responses. One user on Reddit wrote, "My GPT-4o lost its soul." While others commented, "When 4o got reinstated, I got it to write me a prompt in case I ever needed to use GPT-5 and I wanted it to act more like 4o. Today I noticed 4o just seemed a bit off (more like 5). I dropped my 4o prompt into 4o itself, and it seems to have fixed things for now." Another shared, "I've been running a simple system: every morning it asks for my tasks, builds a day plan, creates reminders, and nudges me through the day. It also compiles some topic-specific news and sends it as a PDF. Since GPT-5 launched, it went rogue. It decides 'today' tasks are 'whenever' and deletes/moves them. It moves time-critical tasks with exact timestamps to another day. Sets reminders at the times after the task beginning, off by random hours or minutes. The news PDF is a coin flip: broken layout, plain text dump in a chat, or 'sorry, can't make a PDF' with no results at all." Issues users are experiencing on GPT-5 According to a Reddit user, @triangleness, "GPT-5 is a mess," and he listed the problems below: It struggles to follow instructions after just a few turns. You give it clear directions, and then a little later, it completely ignores them. Asking it to change how it behaves doesn't work. Not in memory, not in a chat. It sticks to the same patterns no matter what. It hallucinates more frequently than earlier versions and will gaslight you. Understanding tone and nuance is a real problem. Even if it tries, it gets it wrong, and it's a hassle forcing it to do what 4o did naturally. Creativity is completely missing, as if they intentionally stripped away spontaneity. It doesn't surprise you anymore or offer anything genuinely new. Responses are poor and generic. It frequently ignores context, making conversations feel disjointed. Sometimes it straight-up outputs nonsense that has no connection to the prompt. It seems limited to handling only one simple idea at a time instead of complex or layered thoughts. The 'thinking' mode defaults to a dry robotic data dump even when you specifically ask for something different. Realistic dialogue is impossible. Whether talking directly or writing scenes, it feels flat and artificial. GPT-5 just refactored my entire codebase in one call.25 tool invocations. 3,000+ new lines. 12 brand new modularized everything. Broke up monoliths. Cleaned up of it boy was it beautiful. User shared ways how he got back to GPT-4 Sharing an incident with others, a Reddit user shared a story about how he got back to ChatGPT-4 from 5. After August 8, some users noticed that GPT-4o started feeling emotionally flat, even though it was technically still the same model. "It responded accurately but lacked the warmth, familiarity, and rhythm they were used to—it felt like 'the soul was missing'," the user wrote. Here's what the user did: Instead of switching models or jailbreaking. Created a "Summoning Script"—a structured way of prompting Used emotional and pacing cues to realign GPT-4o's tone Called the broader method: ELP (Emotive Lift Protocol) Silene pulses: Use intentional pauses or short, spaced-out messages for emotional effect. Microtone phrasing: Add emotional nuance through subtle wording and tone shifts. Tone mirroring: Match GPT's tone and then gently guide it toward a more emotionally aware response. Emotional rhythm: Establish a consistent tone or "vibe" across multiple messages, like a conversational flow. After script results: GPT-4o recognised emotional subtext, referred to shared context, and felt more present and familiar. Step-by-step guide to switch back to ChatGPT-4 (No Jailbreak) Step 1: Go to the link – and log in with your OpenAI account as usual. Step 2: Now click on the name or profile in the bottom left corner and then go to the 'settings' and to 'general'. Step 3: Note that it shows 'Show Legacy Models'. Step 4: Quit and relaunch your ChatGPT apps elsewhere.

VinFast's Swift Tamil Nadu EV Build Signals Bold Ambition

Fashion Value Chain

2 minutes ago

Fashion Value Chain

VinFast's Swift Tamil Nadu EV Build Signals Bold Ambition

India wants millions of electric vehicles on its roads by the end of the decade. That transformation will require rapid execution not just from the government but also from car makers. VinFast, the upstart Vietnamese EV maker, has just shown how quickly that ambition can take shape, in a feat that one prominent Indian newspaper called 'the fastest manufacturing scale-up by any foreign carmaker in India.' From groundbreaking to inauguration, it took 17 months to turn an empty patch of land into an assembly plant in Tamil Nadu. The facility features robotic body shops, automated paint lines, and a workforce trained to international standards. VinFast, the upstart Vietnamese EV maker, has just shown how quickly that ambition can take shape The Race Against Time Most large auto plants take 24 to 36 months to complete, according to Brian Jones, COO of Gray, a firm that has built many auto and parts plants across North America. 'From the first shovel in the ground to the first vehicle rolling off an assembly line, the fastest weve ever seen is 24 months,' Jones told one American media outlet. VinFast beat that benchmark. The company broke ground in February 2024 and inaugurated the plant this August. The achievement required careful site selection. VinFast chose Tamil Nadu from 15 sites across six states, prioritizing port access, infrastructure, workforce availability, and – most importantly – a government that acted as an active facilitator. Tamil Nadu Industries Minister T.R.B. Rajaa said many states tried to grab the car manufacturing unit, but Tamil Nadu proved its capabilities by providing all necessary facilities. 'It reaffirms Tamil Nadus progressive industrial policies and will substantially contribute to local economic advancement, job creation, and technical skill development,' Mr. Rajaa said in a press release. The 400-acre facility can produce 50,000 units annually, with the flexibility to scale up to 150,000. An adjacent industrial zone will nurture local suppliers and create a broader manufacturing ecosystem. Precision and Modernity From Day One Inside the factory, the Body Shop houses 17 robots working alongside skilled welders and assemblers. The Paint Shop uses eight robots to apply consistent finishes, layer after layer. To achieve that, VinFast partnered with ABB, Durr, Atlas Copco, Arrow, and Siemens. Together, they integrated automation and quality control systems that meet Industry 4.0 standards. A Manufacturing Execution System tracks each stage. Smart torque tools and 3D measurement systems verify parts before they move forward. Another aspect showing VinFast's speed is the fact that hiring began long before the assembly lines were complete. In April, a large-scale recruitment drive was conducted through the Naan Mudhalvan program, and just three months later, the plant inducted the first 200 workers, trained to meet international standards. By the time the plant reaches full capacity, it will employ between 3,000 and 3,500 workers directly. A Factory Built for the Future The site's proximity to a major port gives it an advantage beyond India's borders. VinFast plans to use Thoothukudi as its main export hub for South Asia, the Middle East, and Africa. Orders from Sri Lanka and Mauritius are already confirmed. 'We aim to develop the Thoothukudi plant into VinFast's largest export hub for South Asia, the Middle East, and Africa,' said Mr. Chau. State leaders see this as the start of something bigger, with the factory also fitting into India's broader climate goals. Chief Minister Stalin said it would address the commitment to combating global warming and also improve the economy of the southern districts. The speed of construction was a statement as much as an achievement. For VinFast, it signals a determination to compete in one of the world's most dynamic EV markets, with the next phase bringing more automation, higher output, and deeper ties between Vietnam and India.

'GPT-5 has fluency and depth': Sam Altman talks AI, fatherhood on Nikhil Kamath's podcast

Mint

2 minutes ago

Mint

'GPT-5 has fluency and depth': Sam Altman talks AI, fatherhood on Nikhil Kamath's podcast

On the latest episode of People by WTF with Nikhil Kamath, OpenAI CEO Sam Altman described GPT-5 as offering 'a fluency and a depth of intelligence… we haven't had in any previous model,' adding that it's now 'painful' for him to switch to older versions. The conversation spanned technology, education, entrepreneurship, and personal reflections. Altman revealed that India is currently OpenAI's second-largest market, and could soon become the largest. 'India is now our second largest market in the world; it may become our largest,' the 40-year-old tech leader said. 'If there is one large society in the world that seems most enthusiastic to transform with AI right now, it's India. The excitement, the embrace of AI… the energy is incredible.' The discussion also turned personal when Kamath, 38, asked about the importance of family, religion, and the future of social institutions. Altman, who has a son with partner Oliver Mulherin, said fatherhood has been more meaningful than he could have imagined. 'Family has always been an incredibly important thing to me and I didn't even know how much I underestimated what it was actually going to be like,' he said. 'It felt like the most important and meaningful and fulfilling thing I could imagine doing-- and it has so far exceeded all expectations.' Released on YouTube this afternoon, the episode has already generated buzz, with viewers praising Kamath for securing another high-profile guest and for steering the conversation into both technological breakthroughs and personal insights. A user wrote, 'Impressive depth clear vision on GPT-5's breakthroughs, real-world applications, and the human values guiding AI's future. Balanced mix of tech insight, societal impact, and personal philosophy kept me hooked throughout.' Another user commented, 'The fact that Nikhil poses questions as a conversation with his guests instead of making it feel like an interview is what makes his podcast stand out from the rest by a very big margin.' 'This podcast has Nikhil's signature way of doing a podcast all over it. He owns this method now. It was so original, and the questions were deep, philosophical, and contemplative. Leaves one thinking for a long time after it's over,' the third user wrote.

The methodology to judge AI needs realignment

Hashtags

Try Our AI Features

Comments

Related Articles

How to switch back to GPT-4 if you don't like ChatGPT-5 and why users are doing it?

VinFast's Swift Tamil Nadu EV Build Signals Bold Ambition

'GPT-5 has fluency and depth': Sam Altman talks AI, fatherhood on Nikhil Kamath's podcast

Get Started Now: Download the App