AI benchmarking platform is helping top companies rig their model performances, study claims
When you buy through links on our articles, Future and its syndication partners may earn a commission.
The go-to benchmark for artificial intelligence (AI) chatbots is facing scrutiny from researchers who claim that its tests favor proprietary AI models from big tech companies.
LM Arena effectively places two unidentified large language models (LLMs) in a battle to see which can best tackle a prompt, with users of the benchmark voting for the output they like most. The results are then fed into a leaderboard that tracks which models perform the best and how they have improved.
However, researchers have claimed that the benchmark is skewed, granting major LLMs "undisclosed private testing practices" that give them an advantage over open-source LLMs. The researchers published their findings April 29 in on the preprint database arXiv, so the study has not yet been peer reviewed.
"We show that coordination among a handful of providers and preferential policies from Chatbot Arena [later LM Arena] towards the same small group have jeopardized scientific integrity and reliable Arena rankings," the researchers wrote in the study. "As a community, we must demand better."
Beginning as Chatbot Arena, a research project created in 2023 by researchers at the University of California, Berkeley's Sky Computing Lab, LM Arena quickly became a popular site for top AI companies and open-source underdogs to test their models. Favoring "vibes-based" analysis drawn from user responses over academic benchmarks, the site now gets more than 1 million visitors a month.
To assess the impartiality of the site, the researchers measured more than 2.8 million battles taken over a five-month period. Their analysis suggests that a handful of preferred providers — the flagship models of companies including Meta, OpenAI, Google and Amazon — had "been granted disproportionate access to data and testing" as their models appeared in a higher number of battles, conferring their final versions with a significant advantage.
"Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively," the researchers wrote. "In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data."
In addition, the researchers noted that proprietary LLMs are tested in LM Arena multiple times before their official release. Therefore, these models have more access to the arena's data, meaning that when they are finally pitted against other LLMs they can handily beat them, with only the best-performing iteration of each LLM placed on the public leaderboard, the researchers claimed.
"At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives," the researchers wrote in the study. "Both these policies lead to large data access asymmetries over time."
In effect, the researchers argue that being able to test multiple pre-release LLMs, having the ability to retract benchmark scores, only having the highest performing iteration of their LLM placed on the leaderboard, as well as certain commercial models appearing in the arena more often than others, gives big AI companies the ability to "overfit" their models. This potentially boosts their arena performance over competitors, but it may not mean their models are necessarily of better quality.
RELATED STORIES
— Scientists use AI to encrypt secret messages that are invisible to cybersecurity systems
— What is the Turing test? How the rise of generative AI may have broken the famous imitation game
— US Air Force wants to develop smarter mini-drones powered by brain-inspired AI chips
The research has called into question the authority of LM Arena as an AI benchmark. LM Arena has yet to provide an official comment to Live Science, only offering background information in an email response. But the organization did post a response to the research on the social platform X.
"Regarding the statement that some model providers are not treated fairly: this is not true. Given our capacity, we have always tried to honor all the evaluation requests we have received," company representatives wrote in the post. "If a model provider chooses to submit more tests than another model provider, this does not mean the second model provider is treated unfairly. Every model provider makes different choices about how to use and value human preferences."
LM Arena also claimed that there were errors in the researchers' data and methodology, responding that LLM developers don't get to choose the best score to disclose, and that only the score achieved by a released LLM is put on the public leaderboard.
Nonetheless, the findings raise questions about how LLMs can be tested in a fair and consistent manner, particularly as passing the Turing test isn't the AI watermark it arguably once was, and that scientists are looking at better ways to truly assess the rapidly growing capabilities of AI.
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles


Business Wire
26 minutes ago
- Business Wire
The AI Collective Launches Globally to Mobilize Next Generation of AI Innovators and Stewards
SAN FRANCISCO--(BUSINESS WIRE)--The world's largest grassroots AI community today formally detailed its global rebrand as The AI Collective, its official incorporation as a non-profit organization, and the launch of three foundational initiatives to cultivate a collaborative ecosystem for the responsible stewardship of AI. Formerly The GenAI Collective, the organization boasts 70,000+ members across 25+ global chapters and partners including AWS, Meta, Anthropic, Roam, and Product Hunt. AI's future depends not just on faster models—but on rebuilding trust, fostering global collaboration, and aligning progress with human values. Share This announcement builds on initial excitement sparked by a social media reveal on Monday, June 3rd ( and coincides with flagship launch events currently underway in San Francisco, New York City, and across its global chapters ( As the race to develop artificial intelligence accelerates daily, fueled by intense competition and unprecedented investment, the essential work of aligning these powerful systems with human values and societal trust is often dangerously sidelined. The AI Collective is responding directly to this critical gap, positioning itself as the essential human counterpoint in the age of acceleration by building a global community to steward AI, informing decision-makers, and supporting mission-aligned entrepreneurs. "The ground is shifting beneath us. AI's exponential progress brings immense possibility, but also profound questions about our future, our work, and our very identity – and the institutions we rely on weren't built for this velocity," said Chappy Asel, Co-Founder and Executive Director of The AI Collective. 'Trust is the invisible thread holding society together, and right now, that thread is fraying under the strain of misaligned incentives and rapid, uncoordinated change. We believe that in an era of exponential transformation, our greatest strength lies in each other – in creating trusted, in-person spaces to make sense of this moment, ask the deeper questions, and collaboratively shape a future where AI genuinely enhances human flourishing, not just surpasses human capability,' Asel said. The AI Collective believes that without deliberate intervention focused on trust, openness, and participation, AI's trajectory risks leaving humanity behind. Recognizing that managing this requires bridging the widening gap between technological progress and societal adaptation, the organization is launching three foundational initiatives designed to rebuild trust through action and participation: Expanding Global Community & Events: Rebuilding Trust Through In-Person Connection. The AI Collective will rehumanize innovation by launching 100 chapters across six continents by end of year. Each chapter will fuse its community's unique vision with global shared values of openness and inclusivity, curating salons, hackathons, demo nights, mentorship circles, and cross‑disciplinary mashups to deepen trust and collaboration. Their network will converge at Singularity Fest this November 2025 in San Francisco for a multi‑day, decentralized celebration attracting 10,000 pioneers across domains for hands‑on labs, purpose‑driven keynotes, thematic tracks, and community‑led activations, ensuring AI's next leaps keep humanity firmly at the center. Informing Decision-Makers: Bridging the Gap Between Frontier Insight and Responsible Governance. To counter the risks of institutional blind spots and lagging policy, The AI Collective Institute, the organization's participatory research arm, connects frontier technologists directly with policymakers, industry leaders, and the public. It translates ground-truth insights from the AI ecosystem into practical guidance through open research, equipping frontline leaders to foster responsible innovation and navigate future shock effectively. Supporting Mission-Aligned Innovation: Actively Incentivizing Human-Centric AI Development. In pursuit of a values-aligned AI ecosystem driven by the community's unique trust and insight, Collective Investments acts as a dedicated founder-investor matchmaking program under the non-profit umbrella. It identifies and supports promising founders building trustworthy, beneficial AI, connecting them with values-aligned capital allocators (VCs, grants, angels) and providing crucial support to ensure the AI future being built reflects the principles of human flourishing and responsible progress. The AI Collective is celebrating this evolution with its ongoing flagship events and global chapter celebrations. The organization invites builders, thinkers, policymakers, investors, and pioneers worldwide to join the conversation and contribute to shaping a future where technology serves all of humanity. Read the organization's foundational perspective, " Trust in the Age of Acceleration," learn more and join the community at: and follow the journey @_ai_collective on social platforms. ABOUT THE AI COLLECTIVE The AI Collective (formerly The GenAI Collective) is the world's largest non-profit, grassroots community dedicated to empowering the AI ecosystem to collaboratively steer AI's future toward trust, openness, and human flourishing. Founded in 2023, the AI Collective has rapidly grown into a global force: 70,000+ members: Comprising leading founders, researchers, investors, and multidisciplinary operators from OpenAI, Anthropic, Nvidia, Google, Microsoft, Amazon, Databricks, Cohere, and more. 200+ AI startups launched or showcased at demo nights, connecting directly with investors and clients; participating companies subsequently raised $72M+ in funding. 25+ active chapters with 200+ events hosted, located in major tech hubs globally including New York City, London, Paris, Washington, D.C., Seattle, Bengaluru and more. 40+ leading partners, including Amazon, Anthropic, Andreessen Horowitz, Meta, Github, TedAI, Product Hunt, Roam, Linux Foundation, and top academic institutions. A dedicated global team of 100+ volunteer organizers committed to building authentic, impactful community experiences. Through its focus on in-person connection, participatory research via The AI Collective Institute, and support for mission-aligned innovation via Collective Investments, the organization serves as a vital hub for sense-making, collaboration, and responsible stewardship in the age of artificial intelligence.


New York Times
28 minutes ago
- New York Times
On a Search for an Old E.V., Jay Leno's Car Obsession Came Up Clutch
Times Insider explains who we are and what we do and delivers behind-the-scenes insights into how our journalism comes together. As an energy reporter on the Business desk of The New York Times, I often cover the transition to electrify the world around us, including automobiles and heating and cooling systems. But until I spoke with the historian at the Petersen Automotive Museum in Los Angeles, I did not know that electric cars rattled down city streets as far back as the mid-1890s. A century ago, roughly a third of taxi drivers in New York City shuttled passengers around in electric cars. I set out to write an article about these cars, and a time before lawmakers gave deference to the oil industry by offering numerous tax breaks, paving the way for gasoline-powered vehicles. But finding an original E.V. that I could ride in proved difficult. Most of them sit in museums and personal collections. Enter the comedian — and car collector — Jay Leno. My editor suggested I reach out to Mr. Leno after learning about his 1909 Baker Electric, housed in his famous garage. Mr. Leno's team gave an enthusiastic 'Yes' in reply. When I arrived at his warehouse garage in Burbank, Calif., in April, Mr. Leno had his Baker Electric charged and ready to hit the streets. The 116-year-old car, which had been refurbished, looked like it had just rolled off the showroom floor. Still, the wooden high-top body, 36-inch rubber wheels and Victorian-style upholstery whispered the car's age. It was basically a carriage with batteries, enabling drivers to free horses from their bits and harnesses. Want all of The Times? Subscribe.


Forbes
28 minutes ago
- Forbes
Artificial Intelligence Collaboration and Indirect Regulatory Lag
WASHINGTON, DC - MAY 16: Samuel Altman, CEO of OpenAI, testifies before the Senate Judiciary ... More Subcommittee on Privacy, Technology, and the Law May 16, 2023 in Washington, DC. The committee held an oversight hearing to examine A.I., focusing on rules for artificial intelligence. (Photo by) Steve Jobs often downplayed his accomplishments by saying that 'creativity is just connecting things.' Regardless of whether this affects the way you understand his legacy, it is beyond the range of doubt that most innovation comes from interdisciplinary efforts. Everyone agrees that if AI is to exponentially increase collaboration across disciplines, the laws must not lag too far behind technology. The following explores how a less obvious interpretation of this phrase will help us do what Jobs explained was the logic behind his genius The Regulatory Lag What most people mean when they say that legislation and regulation have difficulty keeping pace with the rate of innovation because the innovation and its consequences are not well known until well after the product hits the market. While that is true, it only tells half of the story. Technological innovations also put more attenuated branches of the law under pressure to adjust. These are second-order, more indirect legal effects, where whole sets of laws—originally unrelated to the new technology—have to adapt to enable society to maximize the full potential of the innovation. One classic example comes from the time right after the Internet became mainstream. After digital communication and connectivity became widespread and expedited international communication and commercial relations, nations discovered that barriers to cross-border trade and investment were getting in the way. Barriers such as tariffs and outdated investment FDI partnership requirements—had to be lowered or eliminated if the Internet was to be an effective catalyst to global economic growth. Neoliberal Reforms When the internet emerged in the 1990s, much attention went to laws that directly regulated it—such as data privacy, digital speech, and cybersecurity. But some of the most important legal changes were not about the internet itself. They were about removing indirect legal barriers that stood in the way of its broader economic and social potential. Cross-border trade and investment rules, for instance, had to evolve. Tariffs on goods, restrictions on foreign ownership, and outdated service regulations had little to do with the internet as a technology, but everything to do with whether global e-commerce, remote work, and digital entrepreneurship could flourish. These indirect legal constraints were largely overlooked in early internet governance debates, yet their reform was essential to unleashing the internet's full power. Artificial Intelligence and Indirect Barriers A comparable story is starting to unfold with artificial intelligence. While much of the focus when talking about law and AI has been given to algorithmic accountability and data privacy, there is also an opportunity for a larger societal return from AI in its ability to reduce barriers between disciplines. AI is increasing the viability of interdisciplinary work because it can synthesize, translate, and apply knowledge across domains in ways that make cross-field collaboration more essential. Already we are seeing marriages of law and computer science, medicine and machine learning, environmental modeling, and language processing. AI is a general-purpose technology that rewards those who are capable of marrying insights across disciplines. In that sense, the AI era is also the era of interdisciplinary boundary-blurring opportunities triggered by AI are up against legal barriers to entry across disciplines and professions. In many professions, it requires learning a patchwork of licensure regimes and intractable definitions of domain knowledge to gain the right to practice or contribute constructively. While some of these regulations are generally intended to protect public interests, they can also hinder innovation and prevent new interdisciplinary practices from gaining traction. To achieve the full potential of AI-enabled collaboration, many of these legal barriers need to be eliminated—or at least reimagined. We are starting to see some positive movements. For example, a few states are starting to grant nurse practitioners and physician assistants greater autonomy in clinical decision-making, and that's a step toward cross-disciplinary collaboration of healthcare and AI diagnostics. For now, this is a move in the right direction. However, In some other fields, the professional rules of engagement support silos. This must change if we're going to be serious about enabling AI to help us crack complex, interdependent problems. Legislators and regulators cannot focus exclusively on the bark that protects the tree of change, they must also focus on the hidden network of roots that that quietly nourish and sustain it.