Daily 8 News | Latest Breaking News & Updates

AI benchmarking platform is helping top companies rig their model performances, study claims

Yahoo

23-05-2025

Business
Yahoo

AI benchmarking platform is helping top companies rig their model performances, study claims

When you buy through links on our articles, Future and its syndication partners may earn a commission. The go-to benchmark for artificial intelligence (AI) chatbots is facing scrutiny from researchers who claim that its tests favor proprietary AI models from big tech companies. LM Arena effectively places two unidentified large language models (LLMs) in a battle to see which can best tackle a prompt, with users of the benchmark voting for the output they like most. The results are then fed into a leaderboard that tracks which models perform the best and how they have improved. However, researchers have claimed that the benchmark is skewed, granting major LLMs "undisclosed private testing practices" that give them an advantage over open-source LLMs. The researchers published their findings April 29 in on the preprint database arXiv, so the study has not yet been peer reviewed. "We show that coordination among a handful of providers and preferential policies from Chatbot Arena [later LM Arena] towards the same small group have jeopardized scientific integrity and reliable Arena rankings," the researchers wrote in the study. "As a community, we must demand better." Beginning as Chatbot Arena, a research project created in 2023 by researchers at the University of California, Berkeley's Sky Computing Lab, LM Arena quickly became a popular site for top AI companies and open-source underdogs to test their models. Favoring "vibes-based" analysis drawn from user responses over academic benchmarks, the site now gets more than 1 million visitors a month. To assess the impartiality of the site, the researchers measured more than 2.8 million battles taken over a five-month period. Their analysis suggests that a handful of preferred providers — the flagship models of companies including Meta, OpenAI, Google and Amazon — had "been granted disproportionate access to data and testing" as their models appeared in a higher number of battles, conferring their final versions with a significant advantage. "Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively," the researchers wrote. "In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data." In addition, the researchers noted that proprietary LLMs are tested in LM Arena multiple times before their official release. Therefore, these models have more access to the arena's data, meaning that when they are finally pitted against other LLMs they can handily beat them, with only the best-performing iteration of each LLM placed on the public leaderboard, the researchers claimed. "At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives," the researchers wrote in the study. "Both these policies lead to large data access asymmetries over time." In effect, the researchers argue that being able to test multiple pre-release LLMs, having the ability to retract benchmark scores, only having the highest performing iteration of their LLM placed on the leaderboard, as well as certain commercial models appearing in the arena more often than others, gives big AI companies the ability to "overfit" their models. This potentially boosts their arena performance over competitors, but it may not mean their models are necessarily of better quality. RELATED STORIES — Scientists use AI to encrypt secret messages that are invisible to cybersecurity systems — What is the Turing test? How the rise of generative AI may have broken the famous imitation game — US Air Force wants to develop smarter mini-drones powered by brain-inspired AI chips The research has called into question the authority of LM Arena as an AI benchmark. LM Arena has yet to provide an official comment to Live Science, only offering background information in an email response. But the organization did post a response to the research on the social platform X. "Regarding the statement that some model providers are not treated fairly: this is not true. Given our capacity, we have always tried to honor all the evaluation requests we have received," company representatives wrote in the post. "If a model provider chooses to submit more tests than another model provider, this does not mean the second model provider is treated unfairly. Every model provider makes different choices about how to use and value human preferences." LM Arena also claimed that there were errors in the researchers' data and methodology, responding that LLM developers don't get to choose the best score to disclose, and that only the score achieved by a released LLM is put on the public leaderboard. Nonetheless, the findings raise questions about how LLMs can be tested in a fair and consistent manner, particularly as passing the Turing test isn't the AI watermark it arguably once was, and that scientists are looking at better ways to truly assess the rapidly growing capabilities of AI.

Chatbot Arena secures $100m to enhance AI platform

Yahoo

22-05-2025

Business
Yahoo

Chatbot Arena secures $100m to enhance AI platform

Chatbot Arena, a platform designed to compare the performance of various AI models, has raised $100m in seed funding. The round was led by Andreessen Horowitz (a16z) and UC Investments (University of California), with participation from Lightspeed, Laude Ventures, Felicis, Kleiner Perkins, and The House Fund. The round values the company at $600m, according to a Bloomberg report. The funding coincides with the upcoming relaunch of LMArena, featuring a fully rebuilt platform designed to enhance AI evaluation with greater rigor, transparency, and user focus. The platform, which originated as an academic project at UC Berkeley, enables researchers, developers, and users to assess how AI models perform in real-world scenarios. More than 400 model evaluations have been conducted on LMArena, with more than three million votes cast, influencing both proprietary and open-source models from companies such as Google, OpenAI, Meta, and xAI, the company said. LMArena co-founder and CEO Anastasios Angelopoulos said: "In a world racing to build ever-bigger models, the hard question is no longer what can AI do. Rather, it's how well can it do it for specific use cases, and for whom. We're building the infrastructure to answer these critical questions." The relaunched LMArena, set to debut in late May 2025, incorporates community feedback and introduces a rebuilt user interface, mobile-first design, lower latency, and new features like saved chat history and endless chat. Ion Stoica, co-founder and UC Berkeley professor, said: 'AI evaluation has often lagged behind model development. LMArena closes that gap by putting rigorous, community-driven science at the centre. It's refreshing to be part of a team that leads with long-term integrity in a space moving this fast.' The company collaborates with model providers to identify performance trends, gather human preference data, and test updates in real-world conditions, aiming to develop advanced analytics and enterprise services while keeping core participation free. 'We invested in LMArena because the future of AI depends on reliability,' said Anjney Midha, general partner at a16z. 'And reliability requires transparent, scientific, community-led evaluation. LMArena is building that backbone.' Jagdeep Singh Bachher, chief investment officer at UC Investments, said: 'We're excited to see open AI research translated into real-world impact through platforms like LMArena. Supporting innovation from university labs such as those at UC Berkeley is essential for building technologies that responsibly serve the public and advance the field.' "Chatbot Arena secures $100m to enhance AI platform" was originally created and published by Verdict, a GlobalData owned brand. The information on this site has been included in good faith for general informational purposes only. It is not intended to amount to advice on which you should rely, and we give no representation, warranty or guarantee, whether express or implied as to its accuracy or completeness. You must obtain professional or specialist advice before taking, or refraining from, any action on the basis of the content on our site.

All About Googles Latest Gemma AI Model That Can Run On smartphones

India.com

22-05-2025

Business
India.com

All About Googles Latest Gemma AI Model That Can Run On smartphones

New Delhi: Google on Tuesday, introduced Gemma 3n on its I/O event on tuesday. It is a new AI model designed to run smoothly on everyday devices like phones, laptops and tablets. The model is now available in preview and can understand and process audio, text, images and even videos. Gemini usually needs an internet connection as most of its tasks are processed in the cloud especially the more complex ones. For those who prefer on-device AI, there's Gemini Nano, which is made to handle tasks directly on smartphones. And now, there's another option: Gemma 3n. Announced at Google I/O 2025, Gemma 3n is built on the same technology as Gemini Nano but is open-source and designed to run smoothly on phones, laptops, and tablets without relying on the cloud. The new system has been created in partnership with Qualcomm, MediaTek and Samsung. It has been designed to deliver fast, efficient and private Ai experiences right on your device. The Gemma 3n model, built on this tech, runs smoothly with just 2GB to 3GB of RAM and is surprisingly quick. It performs on par with top AI models like Anthropic's Claude 3.7 Sonnet, based on recent Chatbot Arena rankings. What can Gemma 3n do? Gemma 3n is a multimodal AI model, which means it can understand text, voice, and even images — whether they're on your screen or coming from your phone's camera in real time. It can read text, translate languages, solve math problems, and answer complex questions on the spot. While the experience feels similar to what Gemini and Gemini Live offer on mobile, the key difference is that Gemma 3n is built to run directly on your device. It's not a standalone Google app, but a model that developers can integrate into apps or operating systems. Google says Gemma 3n is very efficient, running smoothly with only 2 to 3GB of RAM. It's also faster than many other AI models, both proprietary and open-source. In fact, its performance ranks close to Anthropic's Claude 3.7 Sonnet, according to Chatbot Arena scores.

Study accuses LM Arena of helping top AI labs game its benchmark

Yahoo

01-05-2025

Business
Yahoo

Study accuses LM Arena of helping top AI labs game its benchmark

A new paper from AI lab Cohere, Stanford, MIT, and Ai2 accuses LM Arena, the organization behind the popular crowdsourced AI benchmark Chatbot Arena, of helping a select group of AI companies achieve better leaderboard scores at the expense of rivals. According to the authors, LM Arena allowed some industry-leading AI companies like Meta, OpenAI, Google, and Amazon to privately test several variants of AI models, then not publish the scores of the lowest performers. This made it easier for these companies to achieve a top spot on the platform's leaderboard, though the opportunity was not afforded to every firm, the authors say. "Only a handful of [companies] were told that this private testing was available, and the amount of private testing that some [companies] received is just so much more than others," said Cohere's VP of AI research and co-author of the study, Sara Hooker, in an interview with TechCrunch. "This is gamification." Created in 2023 as an academic research project out of UC Berkeley, Chatbot Arena has become a go-to benchmark for AI companies. It works by putting answers from two different AI models side-by-side in a "battle," and asking users to choose the best one. It's not uncommon to see unreleased models competing in the arena under a pseudonym. Votes over time contribute to a model's score — and, consequently, its placement on the Chatbot Arena leaderboard. While many commercial actors participate in Chatbot Arena, LM Arena has long maintained that its benchmark is an impartial and fair one. However, that's not what the paper's authors say they uncovered. One AI company, Meta, was able to privately test 27 model variants on Chatbot Arena between January and March leading up to the tech giant's Llama 4 release, the authors allege. At launch, Meta only publicly revealed the score of a single model — a model that happened to rank near the top of the Chatbot Arena leaderboard. In an email to TechCrunch, LM Arena Co-Founder and UC Berkeley Professor Ion Stoica said that the study was full of "inaccuracies" and "questionable analysis." "We are committed to fair, community-driven evaluations, and invite all model providers to submit more models for testing and to improve their performance on human preference," said LM Arena in a statement provided to TechCrunch. "If a model provider chooses to submit more tests than another model provider, this does not mean the second model provider is treated unfairly." Armand Joulin, a principal researcher at Google DeepMind, also noted in a post on X that some of the study's numbers were inaccurate, claiming Google only sent one Gemma 3 AI model to LM Arena for pre-release testing. Hooker responded to Joulin on X, promising the authors would make a correction. The paper's authors started conducting their research in November 2024 after learning that some AI companies were possibly being given preferential access to Chatbot Arena. In total, they measured more than 2.8 million Chatbot Arena battles over a five-month stretch. The authors say they found evidence that LM Arena allowed certain AI companies, including Meta, OpenAI, and Google, to collect more data from Chatbot Arena by having their models appear in a higher number of model "battles." This increased sampling rate gave these companies an unfair advantage, the authors allege. Using additional data from LM Arena could improve a model's performance on Arena Hard, another benchmark LM Arena maintains, by 112%. However, LM Arena said in a post on X that Arena Hard performance does not directly correlate to Chatbot Arena performance. Hooker said it's unclear how certain AI companies might've received priority access, but that it's incumbent on LM Arena to increase its transparency regardless. In a post on X, LM Arena said that several of the claims in the paper don't reflect reality. The organization pointed to a blog post it published earlier this week indicating that models from non-major labs appear in more Chatbot Arena battles than the study suggests. One important limitation of the study is that it relied on "self-identification" to determine which AI models were in private testing on Chatbot Arena. The authors prompted AI models several times about their company of origin, and relied on the models' answers to classify them — a method that isn't foolproof. However, Hooker said that when the authors reached out to LM Arena to share their preliminary findings, the organization didn't dispute them. TechCrunch reached out to Meta, Google, OpenAI, and Amazon — all of which were mentioned in the study — for comment. None immediately responded. In the paper, the authors call on LM Arena to implement a number of changes aimed at making Chatbot Arena more "fair." For example, the authors say, LM Arena could set a clear and transparent limit on the number of private tests AI labs can conduct, and publicly disclose scores from these tests. In a post on X, LM Arena rejected these suggestions, claiming it has published information on pre-release testing since March 2024. The benchmarking organization also said it "makes no sense to show scores for pre-release models which are not publicly available," because the AI community cannot test the models for themselves. The researchers also say LM Arena could adjust Chatbot Arena's sampling rate to ensure that all models in the arena appear in the same number of battles. LM Arena has been receptive to this recommendation publicly, and indicated that it'll create a new sampling algorithm. The paper comes weeks after Meta was caught gaming benchmarks in Chatbot Arena around the launch of its above-mentioned Llama 4 models. Meta optimized one of the Llama 4 models for 'conversationality,' which helped it achieve an impressive score on Chatbot Arena's leaderboard. But the company never released the optimized model — and the vanilla version ended up performing much worse on Chatbot Arena. At the time, LM Arena said Meta should have been more transparent in its approach to benchmarking. Earlier this month, LM Arena announced it was launching a company, with plans to raise capital from investors. The study increases scrutiny on private benchmark organization's — and whether they can be trusted to assess AI models without corporate influence clouding the process. This article originally appeared on TechCrunch at Error in retrieving data Sign in to access your portfolio Error in retrieving data Error in retrieving data Error in retrieving data Error in retrieving data

Yahoo

01-05-2025

Business
Yahoo

Study accuses LM Arena of helping top AI labs game its benchmark

A new paper from AI lab Cohere, Stanford, MIT, and Ai2 accuses LM Arena, the organization behind the popular crowdsourced AI benchmark Chatbot Arena, of helping a select group of AI companies achieve better leaderboard scores at the expense of rivals. According to the authors, LM Arena allowed some industry-leading AI companies like Meta, OpenAI, Google, and Amazon to privately test several variants of AI models, then not publish the scores of the lowest performers. This made it easier for these companies to achieve a top spot on the platform's leaderboard, though the opportunity was not afforded to every firm, the authors say. "Only a handful of [companies] were told that this private testing was available, and the amount of private testing that some [companies] received is just so much more than others," said Cohere's VP of AI research and co-author of the study, Sara Hooker, in an interview with TechCrunch. "This is gamification." Created in 2023 as an academic research project out of UC Berkeley, Chatbot Arena has become a go-to benchmark for AI companies. It works by putting answers from two different AI models side-by-side in a "battle," and asking users to choose the best one. It's not uncommon to see unreleased models competing in the arena under a pseudonym. Votes over time contribute to a model's score — and, consequently, its placement on the Chatbot Arena leaderboard. While many commercial actors participate in Chatbot Arena, LM Arena has long maintained that its benchmark is an impartial and fair one. However, that's not what the paper's authors say they uncovered. One AI company, Meta, was able to privately test 27 model variants on Chatbot Arena between January and March leading up to the tech giant's Llama 4 release, the authors allege. At launch, Meta only publicly revealed the score of a single model — a model that happened to rank near the top of the Chatbot Arena leaderboard. In an email to TechCrunch, LM Arena Co-Founder and UC Berkeley Professor Ion Stoica said that the study was full of "inaccuracies" and "questionable analysis." "We are committed to fair, community-driven evaluations, and invite all model providers to submit more models for testing and to improve their performance on human preference," said LM Arena in a statement provided to TechCrunch. "If a model provider chooses to submit more tests than another model provider, this does not mean the second model provider is treated unfairly." Armand Joulin, a principal researcher at Google DeepMind, also noted in a post on X that some of the study's numbers were inaccurate, claiming Google only sent one Gemma 3 AI model to LM Arena for pre-release testing. Hooker responded to Joulin on X, promising the authors would make a correction. The paper's authors started conducting their research in November 2024 after learning that some AI companies were possibly being given preferential access to Chatbot Arena. In total, they measured more than 2.8 million Chatbot Arena battles over a five-month stretch. The authors say they found evidence that LM Arena allowed certain AI companies, including Meta, OpenAI, and Google, to collect more data from Chatbot Arena by having their models appear in a higher number of model "battles." This increased sampling rate gave these companies an unfair advantage, the authors allege. Using additional data from LM Arena could improve a model's performance on Arena Hard, another benchmark LM Arena maintains, by 112%. However, LM Arena said in a post on X that Arena Hard performance does not directly correlate to Chatbot Arena performance. Hooker said it's unclear how certain AI companies might've received priority access, but that it's incumbent on LM Arena to increase its transparency regardless. In a post on X, LM Arena said that several of the claims in the paper don't reflect reality. The organization pointed to a blog post it published earlier this week indicating that models from non-major labs appear in more Chatbot Arena battles than the study suggests. One important limitation of the study is that it relied on "self-identification" to determine which AI models were in private testing on Chatbot Arena. The authors prompted AI models several times about their company of origin, and relied on the models' answers to classify them — a method that isn't foolproof. However, Hooker said that when the authors reached out to LM Arena to share their preliminary findings, the organization didn't dispute them. TechCrunch reached out to Meta, Google, OpenAI, and Amazon — all of which were mentioned in the study — for comment. None immediately responded. In the paper, the authors call on LM Arena to implement a number of changes aimed at making Chatbot Arena more "fair." For example, the authors say, LM Arena could set a clear and transparent limit on the number of private tests AI labs can conduct, and publicly disclose scores from these tests. In a post on X, LM Arena rejected these suggestions, claiming it has published information on pre-release testing since March 2024. The benchmarking organization also said it "makes no sense to show scores for pre-release models which are not publicly available," because the AI community cannot test the models for themselves. The researchers also say LM Arena could adjust Chatbot Arena's sampling rate to ensure that all models in the arena appear in the same number of battles. LM Arena has been receptive to this recommendation publicly, and indicated that it'll create a new sampling algorithm. The paper comes weeks after Meta was caught gaming benchmarks in Chatbot Arena around the launch of its above-mentioned Llama 4 models. Meta optimized one of the Llama 4 models for 'conversationality,' which helped it achieve an impressive score on Chatbot Arena's leaderboard. But the company never released the optimized model — and the vanilla version ended up performing much worse on Chatbot Arena. At the time, LM Arena said Meta should have been more transparent in its approach to benchmarking. Earlier this month, LM Arena announced it was launching a company, with plans to raise capital from investors. The study increases scrutiny on private benchmark organization's — and whether they can be trusted to assess AI models without corporate influence clouding the process.

Latest news with #ChatbotArena

AI benchmarking platform is helping top companies rig their model performances, study claims

Chatbot Arena secures $100m to enhance AI platform

All About Googles Latest Gemma AI Model That Can Run On smartphones

Study accuses LM Arena of helping top AI labs game its benchmark

Study accuses LM Arena of helping top AI labs game its benchmark

Get Started Now: Download the App