logo
#

Latest news with #SWE-benchVerified

The methodology to judge AI needs realignment
The methodology to judge AI needs realignment

Hindustan Times

time4 days ago

  • Hindustan Times

The methodology to judge AI needs realignment

When Anthropic released Claude 4 a week ago, the artificial intelligence (AI) company said these models set 'new standards for coding, advanced reasoning, and AI agents'. They cite leading scores on SWE-bench Verified, a benchmark for performance on real software engineering tasks. OpenAI also claims the o3 and o4-mini models return best scores on certain benchmarks. As does Mistral, for the open-source Devstral coding model. AI companies flexing comparative test scores is a common theme. The world of technology has for long obsessed over synthetic benchmark test scores. Processor performance, memory bandwidth, speed of storage, graphics performance — plentiful examples, often used to judge whether a PC or a smartphone was worth your time and money. Yet, experts believe it may be time to evolve methodology for AI testing, rather than a wholesale change. American venture capitalist Mary Meeker, in the latest AI Trends report, notes that AI is increasingly doing better than humans in terms of accuracy and realism. She points to the MMLU (Massive Multitask Language Understanding) benchmark, which averages AI models at 92.30% accuracy compared with a human baseline of 89.8%. MMLU is a benchmark to judge a model's general knowledge across 57 tasks covering professional and academic subjects including math, law, medicine and history. Benchmarks serve as standardised yardsticks to measure, compare, and understand evolution of different AI models. Structured assessments that provide comparable scores for different models. These typically consist of datasets containing thousands of curated questions, problems, or tasks that test particular aspects of intelligence. Understanding benchmark scores requires context about both scale and meaning behind numbers. Most benchmarks report accuracy as a percentage, but the significance of these percentages varies dramatically across different tests. On MMLU, random guessing would yield approximately 25% accuracy since most questions are multiple choice. Human performance typically ranges from 85-95% depending on subject area. Headline numbers often mask important nuances. A model might excel in certain subjects, more than others. An aggregated score may hide weaker performance on tasks requiring multi-step reasoning or creative problem-solving, behind strong performance on factual recall. AI engineer and commentator Rohan Paul notes on X that 'most benchmarks don't reward long-term memory, rather they focus on short-context tasks.' Increasingly, AI companies are looking closely at the 'memory' aspect. Researchers at Google, in a new paper, detail an attention technique dubbed 'Infini-attention', to configure how AI models extend their 'context window'. Mathematical benchmarks often show wider performance gaps. While most latest AI models score over 90% on accuracy, on the GSM8K benchmark (Claude Sonnet 3.5 leads with 97.72% while GPT-4 scores 94.8%), the more challenging MATH benchmark sees much lower ratings in comparison — Google Gemini 2.0 Flash Experimental with 89.7% leads, while GPT-4 scores 84.3%; Sonnet hasn't been tested yet). Reworking the methodology For AI testing, there is a need to realign testbeds. 'All the evals are saturated. It's becoming slightly meaningless,' the words of Satya Nadella, chairman and chief executive officer (CEO) of Microsoft, while speaking at venture capital firm Madrona's annual meeting, earlier this year. The tech giant has announced they're collaborating with institutions including Penn State University, Carnegie Mellon University and Duke University, to develop an approach to evaluate AI models that predicts how they will perform on unfamiliar tasks and explain why, something current benchmarks struggle to do. An attempt is being made to make benchmarking agents for dynamic evaluation of models, contextual predictability, human-centric comparatives and cultural aspects of generative AI. 'The framework uses ADeLe (annotated-demand-levels), a technique that assesses how demanding a task is for an AI model by applying measurement scales for 18 types of cognitive and knowledge-based abilities,' explains Lexin Zhou, Research Assistant at Microsoft. Momentarily, popular benchmarks include SWE-bench (or Software Engineering Benchmark) Verified to evaluate AI coding skills, ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) to judge generalisation and reasoning, as well as LiveBench AI that measures agentic coding tasks and evaluates LLMs on reasoning, coding and math. Among limitations that can affect interpretation, many benchmarks can be 'gamed' through techniques that improve scores without necessarily improving intelligence or capability. Case in point, Meta's new Llama models. In April, they announced an array of models, including Llama 4 Scout, the Llama 4 Maverick, and still-being-trained Llama 4 Behemoth. Meta CEO Mark Zuckerberg claims the Behemoth will be the 'highest performing base model in the world'. Maverick began ranking above OpenAI's GPT-4o in LMArena benchmarks, and just below Gemini 2.5 Pro. That is where things went pear shaped for Meta, as AI researchers began to dig through these scores. Turns out, Meta had shared a Llama 4 Maverick model that was optimised for this test, and not exactly a spec customers would get. Meta denies customisations. 'We've also heard claims that we trained on test sets — that's simply not true and we would never do that. Our best understanding is that the variable quality people are seeing is due to needing to stabilise implementations,' says Ahmad Al-Dahle, VP of generative AI at Meta, in a statement. There are other challenges. Models might memorise patterns specific to benchmark formats rather than developing genuine understanding. The selection and design of benchmarks also introduces bias. There's a question of localisation. Yi Tay, AI Researcher at Google AI and DeepMind has detailed one such regional-specific benchmark called SG-Eval, focused on helping train AI models for wider context. India too is building a sovereign large language model (LLM), with Bengaluru-based AI startup Sarvam, selected under the IndiaAI Mission. As AI capabilities continue advancing, researchers are developing evaluation methods that test for genuine understanding, robustness across context and capabilities in the real-world, rather than plain pattern matching. In the case of AI, numbers tell an important part of the story, but not the complete story.

Debates over AI benchmarking have reached Pokémon
Debates over AI benchmarking have reached Pokémon

Yahoo

time14-04-2025

  • Yahoo

Debates over AI benchmarking have reached Pokémon

Not even Pokémon is safe from AI benchmarking controversy. Last week, a post on X went viral, claiming that Google's latest Gemini model surpassed Anthropic's flagship Claude model in the original Pokémon video game trilogy. Reportedly, Gemini had reached Lavendar Town in a developer's Twitch stream; Claude was stuck at Mount Moon as of late February. But what the post failed to mention is that Gemini had an advantage: a minimap. As users on Reddit pointed out, the developer who maintains the Gemini stream built a custom minimap that helps the model identify "tiles" in the game like cuttable trees. This reduces the need for Gemini to analyze screenshots before it makes gameplay decisions. Now, Pokémon is a semi-serious AI benchmark at best — few would argue it's a very informative test of a model's capabilities. But it is an instructive example of how different implementations of a benchmark can influence the results. For example, Anthropic reported two scores for its recent Anthropic 3.7 Sonnet model on the benchmark SWE-bench Verified, which is designed to evaluate a model's coding abilities. Claude 3.7 Sonnet achieved 62.3% accuracy on SWE-bench Verified, but 70.3% with a "custom scaffold" that Anthropic developed. More recently, Meta fine-tuned a version of one of its newer models, Llama 4 Maverick, to perform well on a particular benchmark, LM Arena. The vanilla version of the model scores significantly worse on the same evaluation. Given that AI benchmarks — Pokémon included — are imperfect measures to begin with, custom and non-standard implementations threaten to muddy the waters even further. That is to say, it doesn't seem likely that it'll get any easier to compare models as they're released. This article originally appeared on TechCrunch at

Debates over AI benchmarking have reached Pokémon
Debates over AI benchmarking have reached Pokémon

Yahoo

time14-04-2025

  • Yahoo

Debates over AI benchmarking have reached Pokémon

Not even Pokémon is safe from AI benchmarking controversy. Last week, a post on X went viral, claiming that Google's latest Gemini model surpassed Anthropic's flagship Claude model in the original Pokémon video game trilogy. Reportedly, Gemini had reached Lavendar Town in a developer's Twitch stream; Claude was stuck at Mount Moon as of late February. But what the post failed to mention is that Gemini had an advantage: a minimap. As users on Reddit pointed out, the developer who maintains the Gemini stream built a custom minimap that helps the model identify "tiles" in the game like cuttable trees. This reduces the need for Gemini to analyze screenshots before it makes gameplay decisions. Now, Pokémon is a semi-serious AI benchmark at best — few would argue it's a very informative test of a model's capabilities. But it is an instructive example of how different implementations of a benchmark can influence the results. For example, Anthropic reported two scores for its recent Anthropic 3.7 Sonnet model on the benchmark SWE-bench Verified, which is designed to evaluate a model's coding abilities. Claude 3.7 Sonnet achieved 62.3% accuracy on SWE-bench Verified, but 70.3% with a "custom scaffold" that Anthropic developed. More recently, Meta fine-tuned a version of one of its newer models, Llama 4 Maverick, to perform well on a particular benchmark, LM Arena. The vanilla version of the model scores significantly worse on the same evaluation. Given that AI benchmarks — Pokémon included — are imperfect measures to begin with, custom and non-standard implementations threaten to muddy the waters even further. That is to say, it doesn't seem likely that it'll get any easier to compare models as they're released. Sign in to access your portfolio

DeepSeek claims its 'reasoning' model beats OpenAI's o1 on certain benchmarks
DeepSeek claims its 'reasoning' model beats OpenAI's o1 on certain benchmarks

Yahoo

time28-01-2025

  • Science
  • Yahoo

DeepSeek claims its 'reasoning' model beats OpenAI's o1 on certain benchmarks

Chinese AI lab DeepSeek has released an open version of DeepSeek-R1, its so-called reasoning model, that it claims performs as well as OpenAI's o1 on certain AI benchmarks. R1 is available from the AI dev platform Hugging Face under an MIT license, meaning it can be used commercially without restrictions. According to DeepSeek, R1 beats o1 on the benchmarks AIME, MATH-500, and SWE-bench Verified. AIME employs other models to evaluate a model's performance, while MATH-500 is a collection of word problems. SWE-bench Verified, meanwhile, focuses on programming tasks. Being a reasoning model, R1 effectively fact-checks itself, which helps it to avoid some of the pitfalls that normally trip up models. Reasoning models take a little longer — usually seconds to minutes longer — to arrive at solutions compared to a typical nonreasoning model. The upside is that they tend to be more reliable in domains such as physics, science, and math. R1 contains 671 billion parameters, DeepSeek revealed in a technical report. Parameters roughly correspond to a model's problem-solving skills, and models with more parameters generally perform better than those with fewer parameters. Indeed, 671 billion parameters is massive, but DeepSeek also released "distilled" versions of R1 ranging in size from 1.5 billion parameters to 70 billion parameters. The smallest can run on a laptop. As for the full R1, it requires beefier hardware, but it is available through DeepSeek's API at prices 90%-95% cheaper than OpenAI's o1. Clem Delangue, the CEO of Hugging Face, said in a post on X on Monday that developers on the platform have created more than 500 "derivative" models of R1 that have racked up 2.5 million downloads combined — five times the number of downloads the official R1 has gotten. There is a downside to R1. Being a Chinese model, it's subject to benchmarking by China's internet regulator to ensure that its responses "embody core socialist values." R1 won't answer questions about Tiananmen Square, for example, or Taiwan's autonomy. Many Chinese AI systems, including other reasoning models, decline to respond to topics that might raise the ire of regulators in the country, such as speculation about the Xi Jinping regime. R1 arrives days after the outgoing Biden administration proposed harsher export rules and restrictions on AI technologies for Chinese ventures. Companies in China were already prevented from buying advanced AI chips, but if the new rules go into effect as written, companies will be faced with stricter caps on both the semiconductor tech and models needed to bootstrap sophisticated AI systems. In a policy document last week, OpenAI urged the U.S. government to support the development of U.S. AI, lest Chinese models match or surpass them in capability. In an interview with The Information, OpenAI's VP of policy Chris Lehane singled out High Flyer Capital Management, DeepSeek's corporate parent, as an organization of particular concern. So far, at least three Chinese labs — DeepSeek, Alibaba, and Kimi, which is owned by Chinese unicorn Moonshot AI — have produced models that they claim rival o1. (Of note, DeepSeek was the first — it announced a preview of R1 in late November.) In a post on X, Dean Ball, an AI researcher at George Mason University, said that the trend suggests Chinese AI labs will continue to be "fast followers." "The impressive performance of DeepSeek's distilled models [...] means that very capable reasoners will continue to proliferate widely and be runnable on local hardware," Ball wrote, "far from the eyes of any top-down control regime." This story originally published on January 20 and was updated on January 27 with more information. TechCrunch has an AI-focused newsletter! Sign up here to get it in your inbox every Wednesday.

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into the world of global news and events? Download our app today from your preferred app store and start exploring.
app-storeplay-store