Latest news with #MATH-500

Stop chasing AI benchmarks—create your own

Yahoo

04-04-2025

Business
Yahoo

Stop chasing AI benchmarks—create your own

Every few months, a new large language model (LLM) is anointed AI champion, with record-breaking benchmark scores. But these celebrated metrics of LLM performance—such as testing graduate-level reasoning and abstract math—rarely reflect real business needs or represent truly novel AI frontiers. For companies in the market for enterprise AI models, basing the decision of which models to use on these leaderboards alone can lead to costly mistakes—from wasted budgets to misaligned capabilities and potentially harmful, domain-specific errors that benchmark scores rarely capture. Public benchmarks can be helpful to individual users by providing directional indicators of AI capabilities. And admittedly, some code-completion and software-engineering benchmarks, like SWE-Bench or Codeforces, are valuable for companies within a narrow range of coding-related, LLM-based business applications. But the most common benchmarks and public leaderboards often distract both businesses and model developers, pushing innovation toward marginal improvements in areas unhelpful for businesses or unrelated to areas of breakthrough AI innovation. The challenge for executives, therefore, lies in designing business-specific evaluation frameworks that test potential models in the environments where they'll actually be deployed. To do that, companies will need to adopt tailored evaluation strategies to run at scale using relevant and realistic data. The flashy benchmarks that model developers tout in their releases are often detached from the realities of enterprise applications. Consider some of the most popular ones: graduate-level reasoning (GPQA Diamond) and high school-level math tests, like MATH-500 and AIME2024. Each of these was cited in the releases for GPT o1, Sonnet 3.7, or DeepSeek's R1. But none of these indicators is helpful in assessing common enterprise applications like knowledge management tools, design assistants, or customer-facing chatbots. Instead of assuming that the "best" model on a given leaderboard is the obvious choice, businesses should use metrics tailored to their specific needs to work backward and identify the right model. Start by testing models on your actual context and data—real customer queries, domain-specific documents, or whatever inputs your system will encounter in production. When real data is scarce or sensitive, companies can craft synthetic test cases that capture the same challenges. Without real-world tests, companies can end up ill-fitting models that may, for instance, require too much memory for edge devices, have latency that's too high for real-time interactions, or have insufficient support for the on-premises deployment sometimes mandated by data governance standards. Salesforce has tried to bridge this gap between common benchmarks and their actual business requirements by developing its own internal benchmark for its CRM-related needs. The company created its own evaluation criteria specifically for tasks like prospecting, nurturing leads, and generating service case summaries—the actual work that marketing and sales teams need AI to perform. Popular benchmarks are not only insufficient for informed business decision-making but can also be misleading. Often LLM media coverage, including all three major recent release announcements, uses benchmarks to compare models based on their average performance. Specific benchmarks are distilled into a single dot, number, or bar. The trouble is that generative AI models are stochastic, highly input-sensitive systems, which means that slight variations of a prompt can make them behave unpredictably. A recent research paper from Anthropic rightly argues that, as a result, single dots on a performance comparison chart are insufficient because of the large error ranges of the evaluation metrics. A recent study by Microsoft found that using a statistically more accurate clustered-based evaluation in the same benchmarks can significantly change the rank ordering of—and public narratives about—models on a leaderboards. That's why business leaders need to ensure reliable measurements of model performance across a reasonable range of variations, done at scale, even if it requires hundreds of test runs. This thoroughness becomes even more critical when multiple systems are combined through AI and data supply chains, potentially increasing variability. For industries like aviation or healthcare, the margin of error is small and far beyond what current AI benchmarks typically guarantee, such that solely relying on leaderboard metrics can obscure substantial operational risk in real-world deployments. Businesses must also test models in adversarial scenarios to ensure the security and robustness of a model—such as a chatbot's resistance to manipulation by bad actors attempting to bypass guardrails—that cannot be measured by conventional benchmarks. LLMs are notably vulnerable to being fooled by sophisticated prompting techniques. Depending on the use case, implementing strong safeguards against these vulnerabilities could determine your technology choice and deployment strategy. The resilience of a model in the face of a potential bad actor could be a more important metric than the model's math or reasoning capabilities. In our view, making AI 'foolproof' is an exciting and impactful next barrier to break for AI researchers, one that may require novel model development and testing techniques. Start with existing evaluation frameworks. Companies should start by leveraging the strengths of existing automated tools (along with human judgment and practical but repeatable measurement goals). Specialized AI evaluation toolkits, such as DeepEval, LangSmith, TruLens, Mastra, or ARTKIT, can expedite and simplify testing, allowing for consistent comparison across models and over time. Bring human experts to the testing ground. Effective AI evaluation requires that automated testing be supplemented with human judgment wherever possible. Automated evaluation could include a comparison of LLM answers to ground truth answers, or the use of proxy metrics, such as automated ROUGE or BLEU scores, to gauge the quality of text summarization. For nuanced assessments, however, ones where machines still struggle, human evaluation remains vital. This could include domain experts or end-users conducting a 'blind' review of a sample of model outputs. Such actions can also flag potential biases in responses, such as LLMs giving responses about job candidates that are biased by gender or race. This human layer of review is labor-intensive, but can provide additional critical insight, like whether a response is actually useful and well-presented. The value of this hybrid approach can be seen in a recent case study where a company evaluated an HR-support chatbot using both human and automated tests. The company's iterative internal evaluation process with human involvement showed a significant source of LLM response errors was due to flawed updates to enterprise data. The discovery highlights how human evaluation can uncover systemic issues beyond the model itself. Focus on tradeoffs, not isolated dimensions of assessment. When evaluating models, companies must look beyond accuracy to consider the full spectrum of business requirements: speed, cost efficiency, operational feasibility, flexibility, maintainability, and regulatory compliance. A model that performs marginally better on accuracy metrics might be prohibitively expensive or too slow for real-time applications. A great example of this is how Open AI's GPT o1(a leader in many benchmarks at release time) performed when applied to the ARC-AGI prize. To the surprise of many, the o1 model performed poorly, largely due to ARC-AGI's 'efficiency limit' on the computing power used to solve the benchmark tasks. The o1 model would often take too long, using more compute time to try to come up with a more accurate answer. Most popular benchmarks don't have a time limit even though time would be a critically important factor for many business use cases. Tradeoffs become even more important in the growing world of (multi)-agentic applications, where simpler tasks can be handled by cheaper, quicker models (overseen by an orchestration agent), while the most complex steps (such as solving the broken-out series of problems from a customer) could need a more powerful version with reasoning to be successful. Microsoft Research's HuggingGPT, for example, orchestrates specialized models for different tasks under a central language model. Being prepared to change models for different tasks requires building flexible tooling that isn't hard-coded to a single model or provider. This built-in flexibility allows companies to easily pivot and change models based on evaluation results. While this may sound like a lot of extra development work, there are a number of available tools, like LangChain, LlamaIndex, and Pydantic AI, that can simplify the process. Turn model testing into a culture of continuous evaluation and monitoring. As technology evolves, ongoing assessment ensures AI solutions remain optimal while maintaining alignment with business objectives. Much like how software engineering teams implement continuous integration and regression testing to catch bugs and prevent performance degradation in traditional code, AI systems require regular evaluation against business-specific benchmarks. Similar to the practice of pharmacovigilance among users of new medicines, feedback from LLM users and affected stakeholders also needs to be continuously gathered and analyzed to ensure AI 'behaves as expected' and doesn't drift from its intended performance targets. This kind of bespoke evaluation framework fosters a culture of experimentation and data-driven decision-making. It also enforces the new and critical mantra: AI may be used for execution, but humans are in control and must govern AI. For business leaders, the path to AI success lies not in chasing the latest benchmark champions but in developing evaluation frameworks for your specific business objectives. Think of this approach as 'a leaderboard for every user,' as one Stanford paper suggests. The true value of AI deployment comes from three key actions: defining metrics that directly measure success in your business context; implementing statistically robust testing in realistic situations using your actual data and in your actual context; and fostering a culture of continuous monitoring, evaluation and experimentation that draws on both automated tools and human expertise to assess tradeoffs across models. By following this approach, executives will be able to identify solutions optimized for their specific needs without paying premium prices for 'top-notch models.' Doing this can hopefully help steer the model development industry away from chasing marginal improvements on the same metrics—falling victim to Goodhart's law with capabilities of limited use for business—and instead free them up to explore new avenues of innovation and the next AI breakthrough. Read other Fortune columns by François Candelon. Francois Candelon is a partner at private equity firm Seven2 and the former global director of the BCG Henderson Institute. Theodoros Evgeniou is a professor at INSEAD and a cofounder of the trust and safety company Tremau. Max Struever is a principal engineer at BCG-X and an ambassador at the BCG Henderson Institute. David Zuluaga Martínez is a partner at Boston Consulting Group and an ambassador at the BCG Henderson Institute. Some of the companies mentioned in this column are past or present clients of the authors' employers. This story was originally featured on Sign in to access your portfolio

DeepSeek claims its 'reasoning' model beats OpenAI's o1 on certain benchmarks

Yahoo

28-01-2025

Science
Yahoo

DeepSeek claims its 'reasoning' model beats OpenAI's o1 on certain benchmarks

Chinese AI lab DeepSeek has released an open version of DeepSeek-R1, its so-called reasoning model, that it claims performs as well as OpenAI's o1 on certain AI benchmarks. R1 is available from the AI dev platform Hugging Face under an MIT license, meaning it can be used commercially without restrictions. According to DeepSeek, R1 beats o1 on the benchmarks AIME, MATH-500, and SWE-bench Verified. AIME employs other models to evaluate a model's performance, while MATH-500 is a collection of word problems. SWE-bench Verified, meanwhile, focuses on programming tasks. Being a reasoning model, R1 effectively fact-checks itself, which helps it to avoid some of the pitfalls that normally trip up models. Reasoning models take a little longer — usually seconds to minutes longer — to arrive at solutions compared to a typical nonreasoning model. The upside is that they tend to be more reliable in domains such as physics, science, and math. R1 contains 671 billion parameters, DeepSeek revealed in a technical report. Parameters roughly correspond to a model's problem-solving skills, and models with more parameters generally perform better than those with fewer parameters. Indeed, 671 billion parameters is massive, but DeepSeek also released "distilled" versions of R1 ranging in size from 1.5 billion parameters to 70 billion parameters. The smallest can run on a laptop. As for the full R1, it requires beefier hardware, but it is available through DeepSeek's API at prices 90%-95% cheaper than OpenAI's o1. Clem Delangue, the CEO of Hugging Face, said in a post on X on Monday that developers on the platform have created more than 500 "derivative" models of R1 that have racked up 2.5 million downloads combined — five times the number of downloads the official R1 has gotten. There is a downside to R1. Being a Chinese model, it's subject to benchmarking by China's internet regulator to ensure that its responses "embody core socialist values." R1 won't answer questions about Tiananmen Square, for example, or Taiwan's autonomy. Many Chinese AI systems, including other reasoning models, decline to respond to topics that might raise the ire of regulators in the country, such as speculation about the Xi Jinping regime. R1 arrives days after the outgoing Biden administration proposed harsher export rules and restrictions on AI technologies for Chinese ventures. Companies in China were already prevented from buying advanced AI chips, but if the new rules go into effect as written, companies will be faced with stricter caps on both the semiconductor tech and models needed to bootstrap sophisticated AI systems. In a policy document last week, OpenAI urged the U.S. government to support the development of U.S. AI, lest Chinese models match or surpass them in capability. In an interview with The Information, OpenAI's VP of policy Chris Lehane singled out High Flyer Capital Management, DeepSeek's corporate parent, as an organization of particular concern. So far, at least three Chinese labs — DeepSeek, Alibaba, and Kimi, which is owned by Chinese unicorn Moonshot AI — have produced models that they claim rival o1. (Of note, DeepSeek was the first — it announced a preview of R1 in late November.) In a post on X, Dean Ball, an AI researcher at George Mason University, said that the trend suggests Chinese AI labs will continue to be "fast followers." "The impressive performance of DeepSeek's distilled models [...] means that very capable reasoners will continue to proliferate widely and be runnable on local hardware," Ball wrote, "far from the eyes of any top-down control regime." This story originally published on January 20 and was updated on January 27 with more information. TechCrunch has an AI-focused newsletter! Sign up here to get it in your inbox every Wednesday.

Latest news with #MATH-500

Stop chasing AI benchmarks—create your own

DeepSeek claims its 'reasoning' model beats OpenAI's o1 on certain benchmarks

Get Started Now: Download the App