logo
#

Latest news with #MMLU

The methodology to judge AI needs realignment
The methodology to judge AI needs realignment

Hindustan Times

time03-06-2025

  • Hindustan Times

The methodology to judge AI needs realignment

When Anthropic released Claude 4 a week ago, the artificial intelligence (AI) company said these models set 'new standards for coding, advanced reasoning, and AI agents'. They cite leading scores on SWE-bench Verified, a benchmark for performance on real software engineering tasks. OpenAI also claims the o3 and o4-mini models return best scores on certain benchmarks. As does Mistral, for the open-source Devstral coding model. AI companies flexing comparative test scores is a common theme. The world of technology has for long obsessed over synthetic benchmark test scores. Processor performance, memory bandwidth, speed of storage, graphics performance — plentiful examples, often used to judge whether a PC or a smartphone was worth your time and money. Yet, experts believe it may be time to evolve methodology for AI testing, rather than a wholesale change. American venture capitalist Mary Meeker, in the latest AI Trends report, notes that AI is increasingly doing better than humans in terms of accuracy and realism. She points to the MMLU (Massive Multitask Language Understanding) benchmark, which averages AI models at 92.30% accuracy compared with a human baseline of 89.8%. MMLU is a benchmark to judge a model's general knowledge across 57 tasks covering professional and academic subjects including math, law, medicine and history. Benchmarks serve as standardised yardsticks to measure, compare, and understand evolution of different AI models. Structured assessments that provide comparable scores for different models. These typically consist of datasets containing thousands of curated questions, problems, or tasks that test particular aspects of intelligence. Understanding benchmark scores requires context about both scale and meaning behind numbers. Most benchmarks report accuracy as a percentage, but the significance of these percentages varies dramatically across different tests. On MMLU, random guessing would yield approximately 25% accuracy since most questions are multiple choice. Human performance typically ranges from 85-95% depending on subject area. Headline numbers often mask important nuances. A model might excel in certain subjects, more than others. An aggregated score may hide weaker performance on tasks requiring multi-step reasoning or creative problem-solving, behind strong performance on factual recall. AI engineer and commentator Rohan Paul notes on X that 'most benchmarks don't reward long-term memory, rather they focus on short-context tasks.' Increasingly, AI companies are looking closely at the 'memory' aspect. Researchers at Google, in a new paper, detail an attention technique dubbed 'Infini-attention', to configure how AI models extend their 'context window'. Mathematical benchmarks often show wider performance gaps. While most latest AI models score over 90% on accuracy, on the GSM8K benchmark (Claude Sonnet 3.5 leads with 97.72% while GPT-4 scores 94.8%), the more challenging MATH benchmark sees much lower ratings in comparison — Google Gemini 2.0 Flash Experimental with 89.7% leads, while GPT-4 scores 84.3%; Sonnet hasn't been tested yet). Reworking the methodology For AI testing, there is a need to realign testbeds. 'All the evals are saturated. It's becoming slightly meaningless,' the words of Satya Nadella, chairman and chief executive officer (CEO) of Microsoft, while speaking at venture capital firm Madrona's annual meeting, earlier this year. The tech giant has announced they're collaborating with institutions including Penn State University, Carnegie Mellon University and Duke University, to develop an approach to evaluate AI models that predicts how they will perform on unfamiliar tasks and explain why, something current benchmarks struggle to do. An attempt is being made to make benchmarking agents for dynamic evaluation of models, contextual predictability, human-centric comparatives and cultural aspects of generative AI. 'The framework uses ADeLe (annotated-demand-levels), a technique that assesses how demanding a task is for an AI model by applying measurement scales for 18 types of cognitive and knowledge-based abilities,' explains Lexin Zhou, Research Assistant at Microsoft. Momentarily, popular benchmarks include SWE-bench (or Software Engineering Benchmark) Verified to evaluate AI coding skills, ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) to judge generalisation and reasoning, as well as LiveBench AI that measures agentic coding tasks and evaluates LLMs on reasoning, coding and math. Among limitations that can affect interpretation, many benchmarks can be 'gamed' through techniques that improve scores without necessarily improving intelligence or capability. Case in point, Meta's new Llama models. In April, they announced an array of models, including Llama 4 Scout, the Llama 4 Maverick, and still-being-trained Llama 4 Behemoth. Meta CEO Mark Zuckerberg claims the Behemoth will be the 'highest performing base model in the world'. Maverick began ranking above OpenAI's GPT-4o in LMArena benchmarks, and just below Gemini 2.5 Pro. That is where things went pear shaped for Meta, as AI researchers began to dig through these scores. Turns out, Meta had shared a Llama 4 Maverick model that was optimised for this test, and not exactly a spec customers would get. Meta denies customisations. 'We've also heard claims that we trained on test sets — that's simply not true and we would never do that. Our best understanding is that the variable quality people are seeing is due to needing to stabilise implementations,' says Ahmad Al-Dahle, VP of generative AI at Meta, in a statement. There are other challenges. Models might memorise patterns specific to benchmark formats rather than developing genuine understanding. The selection and design of benchmarks also introduces bias. There's a question of localisation. Yi Tay, AI Researcher at Google AI and DeepMind has detailed one such regional-specific benchmark called SG-Eval, focused on helping train AI models for wider context. India too is building a sovereign large language model (LLM), with Bengaluru-based AI startup Sarvam, selected under the IndiaAI Mission. As AI capabilities continue advancing, researchers are developing evaluation methods that test for genuine understanding, robustness across context and capabilities in the real-world, rather than plain pattern matching. In the case of AI, numbers tell an important part of the story, but not the complete story.

The 2025 Stanford AI Index: 5 Takeaways That Are Important For Your Business
The 2025 Stanford AI Index: 5 Takeaways That Are Important For Your Business

Forbes

time01-05-2025

  • Business
  • Forbes

The 2025 Stanford AI Index: 5 Takeaways That Are Important For Your Business

Last month the Stanford Institute for Human-Centered AI - an interdisciplinary institute established in 2019 to advance AI research, education, policy, and practice - published its 2025 AI Index Report, which aims to "develop a more thorough and nuanced understanding of the complex field of AI." The report had a number of important takeaways that impact all of us running businesses. Here are the five biggest takeaways. The report says: In 2022, the smallest model registering a score higher than 60% on the Massive Multitask Language Understanding (MMLU) benchmark was PaLM, with 540 billion parameters. By 2024, Microsoft's Phi-3-mini, with just 3.8 billion parameters, achieved the same threshold. This represents a 142-fold reduction in over two years. Depending on the task, LLM inference prices have fallen anywhere from 9 to 900 times per year since 2022. As I wrote previously here, it's becoming more affordable for businesses of all sizes to build their own AI solutions using company data from many different sources. The biggest obstacle remains the price of a developer or IT person sufficiently versed in these tools to do the work. The report says: The number of Al-related incidents rose to 233 in 2024 reached a record high and registered a 56.4% increase over 2023. Among the incidents reported were deepfake intimate images and chatbots allegedly implicated in a teenager's suicide. While this isn't comprehensive, it does show a staggering increase in issues. AI is being used for bad stuff. As business owners we have to let others worry about those terrifying Terminator-Type risks that could destroy human civilization and instead focus on getting training and tools to combat the growing use of AI to fool our employees into downloading malware, opening up our systems for data breaches or inadvertently transferring money out of our accounts. The report says: Al agents show early promise. In short time-horizon settings (two hours), top Al systems score four times higher than human experts, but when given more time to do a task, humans perform better than Al-outscoring it 2-to-1 at 32 hours. Still, Al agents already match human expertise in select tasks, such as writing specific types of code, while delivering results faster. AI agents are rolling out this year but few are worth using in your business due to their immaturity. That's for now. In the meantime we should still be testing them and getting familiar with their capabilities. Within the next few years agents will be commonplace, performing a great deal of work that our employees are currently doing. Some business owners will (rightly) see this as an opportunity to reduce staff. But smarter leaders understand that their job is to prepare their best people to leverage these tools to be even more productive. The U.S. widened its commanding lead in global Al investment. U.S. private Al investment hit $109 billion in 2024, nearly 12 times higher than China's $9.3 billion and 24 times the UK's $4.5 billion. Businesses are turning to Al. In 2024, the proportion of survey respondents reporting Al use by their organizations jumped to 78% from 55% in 2023. Similarly, the number of respondents who reported using generative Al in at least one business function more than doubled-from 33% in 2023 to 71% last year. AI is real. Right now it's a big corporation game with larger brands sinking hundreds of millions of dollars in agentic and generative AI systems that do everything from writing software code to autonomously handling customer service requests. Ultimately these things will pass down to small and mid-sized companies who opt to wait for their core software vendors to introduce AI features into their business processes. U.S. states are leading the way on Al legislation amid slow progress at the federal level. In 2016, only one state-level Al-related law was passed, increasing to 49 by 2023. In the past year alone, that number more than doubled to 131. While proposed Al bills at the federal level have also increased, the number passed remains low. I'm not expecting significant regulations coming at the federal level in the short term. Most will be at the state level and focused on two things: misusing AI in the hiring process (it can be biased) or duping customers with questionable AI bots. Of course, many of these regulations will be decided on by regulators who may be challenged turning on their own TV sets, let alone understanding the implications of AI. But regardless, it will be important to monitor these rules in your state to make sure you're complaint. The above five trends are important for business owners and managers to keep in mind as they're considering AI and other technology investments. And yet as interesting as they are now, won't it be fascinating to see what their 2030 report has to say?

While the US and China compete for AI dominance, Russia's leading model lags behind
While the US and China compete for AI dominance, Russia's leading model lags behind

Yahoo

time06-03-2025

  • Business
  • Yahoo

While the US and China compete for AI dominance, Russia's leading model lags behind

Russia has touted its leading LLM, GigaChat MAX, as part of a national AI strategy. But the model is "unremarkable" and lags behind US and Chinese offerings, AI experts told BI. While the war in Ukraine has stunted development, Moscow may still be developing military AI. Russian President Vladimir Putin wants his country to compete in the global race to build AI, besting models coming out of China and the US. But its flagship large language model, or LLM, isn't even the best at speaking Russian. On the Russian-language version of LLM Arena — where users go to compare and rank the answers of different LLMs — GigaChat MAX comes joint-eighth at the time of writing, behind various versions of Claude, DeepSeek, and ChatGPT. YandexGPT 4 Pro, an LLM developed by the Russian search engine Yandex, is even lower, at joint 18th. On the English-language version, neither appears in the ranking of more than 170 LLMs. GigaChat MAX was developed by Russia's state-majority-owned Sberbank. When its latest iteration launched in November, its Moscow-based lead developer, Evgeny Kosarev, said on LinkedIn that it was "close to GPT4o in quality on Russian and English." But experts have told Business Insider that, despite Putin emphasizing AI development as a crucial avenue for Russian foreign policy, GigaChat MAX is months behind American and Chinese competitors. The country's war against Ukraine has also drained it of expertise. Spokespeople for GigaChat MAX and Yandex did not respond to Business Insider's request for comment. For now, GigaChat MAX, Russia's most developed LLM, is "unremarkable," Lukasz Olejnik, a visiting senior research fellow in cybersecurity at the War Studies department at King's College, London, told BI. On "benchmarks" — standardized tests for AI effectiveness — the models' scores "are much lower," he said, adding that they don't surpass any of the cutting-edge, or "frontier," models, and don't involve any particular innovation. Ben Dubow, a senior fellow at the Center for European Policy Analysis and CTO of data-analysis firm Omelas, added that GigaChat MAX lacked an edge in many ways. While it handles math well, in the Russian language it is far behind most leading Western and Chinese LLMs on some benchmarks, Dubow wrote in The Moscow Times in January. He said that leading LLMs developed in the US were a year ahead of GigaChat MAX's current level on the industry-standard "Massive Multitask Language Understanding," or MMLU, which tests an LLM's general knowledge and problem-solving ability in text-based answers across a huge range of subjects. Dubow also told BI that most AIs are being held to more advanced benchmarks, with MMLU "almost considered passé at this point." "Besting American and Chinese models on Russian language prompts is a top priority for the Russian government's AI strategy, but MAX has not achieved that," Dubow said. Russian President Vladimir Putin has repeatedly emphasized the importance of AI, including at a December conference where he touted GigaChat MAX and said Russia was ready to assist other nations with developing AI. Samuel Bendett, a specialist in Russian military technology at the Center for Strategic and International Studies, told BI that AI was "a status thing" for Russia. But per a global AI ranking produced by UK media startup Tortoise Media, Russia is the only one out of the five "great power" countries — the US, China, France, the UK, and Russia — not at the top of the list. Russia is ranked 31st. Bendett named several factors holding Moscow's AI sector back. Russia's private sector is too small to foster real competition, with almost everything government-supported, he said. Although Sberbank is increasingly casting itself as a technology company, "there is no equivalent to OpenAI and Microsoft or Google or Huawei or Alibaba," he continued. Additionally, Russia's invasion of Ukraine has isolated it from both global expertise and collaboration, as well as access to tech like microchips necessary to train and run complex AI models efficiently. "The story of the Russian AI industry is, in a lot of ways, Putin's expansionism undermining Russia's global standing," said Dubow. 2014 — when Russia annexed Crimea — was a transformative year for AI in the West and China. Meanwhile, 2022, the year Russia launched its full-scale invasion of Ukraine, was the year ChatGPT launched, sparking the generative AI boom. The war in Ukraine accelerated a major brain drain from Russia, according to Dubow. Bendett added that Russia lacks "hundreds of thousands" of high-tech researchers, although he said that he believed many of the "tech refugees" who left Russia to avoid the draft have started to trickle back. Putin acknowledged the problems last year, blaming "unfriendly countries" for the roadblocks and vowing to increase the number of people graduating in AI technology to more than 15,000 a year by 2030, Russia's TASS news agency reported, citing government documents. The report said just 3,000 graduated in 2022. By comparison, the US had more than 73,000 graduates in AI-related fields in 2023, the majority of whom were international talent, according to the Center for Security and Emerging Technology. Serhii Kupriienko, CEO of Swarmer, a Ukrainian startup specializing in AI-based systems, told BI that over the next decade, the US and China's LLMs will help them scale their economies "exponentially" by boosting productivity across various sectors, creating jobs in AI, and speeding up innovation. Meanwhile, Russia's struggles with AI mean its likeliest path forward is to "be subordinate to China and rely on what China's producing," Dubow said. The Kremlin's repeated public statements on AI and the ongoing war in Ukraine have led some analysts to conclude Russia may be secretly developing a dual-use LLM with military applications. In 2022, a Russian official announced the creation of a department for developing AI within the defense ministry. "Russia envisions AI as a transformative tool for its military," Saratoga Foundation military analysts Timothy Thomas and Glen Howard wrote in a February review of Russian writings on military AI. Vitaliy Goncharuk, who chaired Ukraine's AI Committee between 2019 and 2022, believes Russia may be training its AI on the vast amounts of battlefield data being generated in Ukraine. Telegram posts and channels, drone footage, satellite imagery, sound sensors, civilian reports, and hacked material from Ukraine's Delta cloud-based management system — which feeds Ukrainian commanders with battlefield data — all provide ample material, Goncharuk said. AI developed on this would not only help Russia improve its precision in identifying targets but also help it plan its decision-making and real-time front-line operations, Goncharuk said. It could even predict Ukraine's future decision-making and future battlefield operations, he added. Ukraine, too, has gathered vast quantities of battlefield data from three years of war — something that is "truly the holy grail of training your AI models and systems on battlefield target recognition and selection," Bendett told BI. It would be difficult to imagine Russia not quietly also using this data, he added. "They constantly hint at that," he said. Read the original article on Business Insider

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store