Crowdsourced AI benchmarks have serious flaws, some experts say
AI labs are increasingly relying on crowdsourced benchmarking platforms such as Chatbot Arena to probe the strengths and weaknesses of their latest models. But some experts say that there are serious problems with this approach from an ethical and academic perspective.
Over the past few years, labs including OpenAI, Google, and Meta have turned to platforms that recruit users to help evaluate upcoming models' capabilities. When a model scores favorably, the lab behind it will often tout that score as evidence of a meaningful improvement.
It's a flawed approach, however, according to Emily Bender, a University of Washington linguistics professor and co-author of the book "The AI Con." Bender takes particular issue with Chatbot Arena, which tasks volunteers with prompting two anonymous models and selecting the response they prefer.
"To be valid, a benchmark needs to measure something specific, and it needs to have construct validity — that is, there has to be evidence that the construct of interest is well-defined and that the measurements actually relate to the construct," Bender said. "Chatbot Arena hasn't shown that voting for one output over another actually correlates with preferences, however they may be defined."
Asmelash Teka Hadgu, the co-founder of AI firm Lesan and a fellow at the Distributed AI Research Institute, said that he thinks benchmarks like Chatbot Arena are being "co-opted" by AI labs to "promote exaggerated claims." Hadgu pointed to a recent controversy involving Meta's Llama 4 Maverick model. Meta fine-tuned a version of Maverick to score well on Chatbot Arena, only to withhold that model in favor of releasing a worse-performing version.
"Benchmarks should be dynamic rather than static data sets," Hadgu said, "distributed across multiple independent entities, such as organizations or universities, and tailored specifically to distinct use cases, like education, healthcare, and other fields done by practicing professionals who use these [models] for work."
Hadgu and Kristine Gloria, who formerly led the Aspen Institute's Emergent and Intelligent Technologies Initiative, also made the case that model evaluators should be compensated for their work. Gloria said that AI labs should learn from the mistakes of the data labeling industry, which is notorious for its exploitative practices. (Some labs have been accused of the same.)
"In general, the crowdsourced benchmarking process is valuable and reminds me of citizen science initiatives," Gloria said. "Ideally, it helps bring in additional perspectives to provide some depth in both the evaluation and fine-tuning of data. But benchmarks should never be the only metric for evaluation. With the industry and the innovation moving quickly, benchmarks can rapidly become unreliable."
Matt Frederikson, the CEO of Gray Swan AI, which runs crowdsourced red teaming campaigns for models, said that volunteers are drawn to Gray Swan's platform for a range of reasons, including "learning and practicing new skills." (Gray Swan also awards cash prizes for some tests.) Still, he acknowledged that public benchmarks "aren't a substitute" for "paid private" evaluations.
"[D]evelopers also need to rely on internal benchmarks, algorithmic red teams, and contracted red teamers who can take a more open-ended approach or bring specific domain expertise," Frederikson said. "It's important for both model developers and benchmark creators, crowdsourced or otherwise, to communicate results clearly to those who follow, and be responsive when they are called into question."
Alex Atallah, the CEO of model marketplace OpenRouter, which recently partnered with OpenAI to grant users early access to OpenAI's GPT-4.1 models, said open testing and benchmarking of models alone "isn't sufficient." So did Wei-Lin Chiang, an AI doctoral student at UC Berkeley and one of the founders of LMArena, which maintains Chatbot Arena.
"We certainly support the use of other tests," Chiang said. "Our goal is to create a trustworthy, open space that measures our community's preferences about different AI models."
Chiang said that incidents such as the Maverick benchmark discrepancy aren't the result of a flaw in Chatbot Arena's design, but rather labs misinterpreting its policy. LM Arena has taken steps to prevent future discrepancies from occurring, Chiang said, including updating its policies to "reinforce our commitment to fair, reproducible evaluations."
"Our community isn't here as volunteers or model testers," Chiang said. "People use LM Arena because we give them an open, transparent place to engage with AI and give collective feedback. As long as the leaderboard faithfully reflects the community's voice, we welcome it being shared."
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles
Yahoo
16 minutes ago
- Yahoo
What is a Waymo car?
( — Waymo vehicles have been in headlines over the last few days after anti-immigration enforcement protestors set some of them on fire in Los Angeles. But what are they? A Waymo car is a fully autonomous, self-driving vehicle operated by the company Waymo, a Google subsidiary. It operates on a rideshare service application called Waymo One, which offers on-demand, self-driving transportation in select cities, according to How does ICE know who to deport? Waymo began in 2009 in San Francisco as an experiment and went on in 2016 to become a fully-operational robotaxi service in some major cities. 'The Waymo Driver is the embodiment of fully autonomous technology that is always in control from pickup to destination,' the company said on its website. 'Passengers don't even need to know how to drive. They can sit in the back seat, relax, and enjoy the ride with the Waymo Driver getting them to their destination safely.' Waymo cars are currently available in Phoenix, AZ, San Francisco, Los Angeles, and Austin, TX, according to its website. They are reportedly coming soon to Atlanta, GA, and Miami, FL. Copyright 2025 Nexstar Media, Inc. All rights reserved. This material may not be published, broadcast, rewritten, or redistributed.


Business Insider
an hour ago
- Business Insider
Microsoft-Backed OpenAI (MSFT) Has Reached $10B in Annual Recurring Revenue
Microsoft-backed AI company OpenAI has reached $10 billion in annual recurring revenue (ARR) less than three years after launching its popular ChatGPT chatbot. This total includes sales from its consumer products, ChatGPT business services, and its API, but does not include licensing revenue from Microsoft (MSFT) or large one-time deals, according to an OpenAI spokesperson. Interestingly, this amount jumped from about $5.5 billion in ARR last year. However, its rapid growth has come at a cost, as OpenAI lost around $5 billion last year to support its expansion. Confident Investing Starts Here: Nevertheless, OpenAI is aiming for even bigger targets. Indeed, the company is reportedly looking to reach $125 billion in revenue by 2029, according to a person familiar with the plans, which was first reported by The Information. In addition, earlier this year, OpenAI closed a $40 billion funding round that turned out to be the largest private tech deal on record and valued the company at $300 billion. As a result, based on today's revenue, OpenAI is currently valued at about 30 times sales. This highlights the huge expectations that investors have of the company. It is worth noting that OpenAI first launched ChatGPT for consumers in late 2022 and added business products the following year, with adoption growing quickly. By late March, the firm said that it had 500 million weekly active users. Moreover, earlier this month, OpenAI reported that it now has 3 million paying business users, which was an increase from 2 million in February. These strong user numbers are helping drive OpenAI's fast revenue growth as it continues to expand its AI products for both consumers and businesses. Is MSFT Stock a Buy? Although you cannot directly invest in OpenAI, you can buy shares of Microsoft, which has a 49% stake in OpenAI. And according to analysts, Microsoft stock has a Strong Buy consensus rating among 36 Wall Street analysts. That rating is based on 31 Buys and five Holds assigned in the last three months. Furthermore, the average MSFT price target of $514.93 implies 9% upside potential.


Digital Trends
2 hours ago
- Digital Trends
Google has a really weird problem at its new London HQ
When Google's striking new office building finally opens in London later this year, it'll be home to up to as many as 7,000 workers … and possibly a few foxes, too. The cunning creature has taken up residence on the building's 300-meter-long rooftop garden and its unexpected occupation has been an issue for the last three years, according to a Guardian report (via London Centric). Recommended Videos The expansive roof area has been filled with wildflowers and woodland plants and is supposed to be an area for Google employees to relax and enjoy a bite to eat, or maybe even dream up the next big idea for the tech giant. But the lush garden is likely to be out of bounds if the foxes are still roaming free there. 'Fox sightings at construction sites are pretty common, and our King's Cross development is no exception,' Google told the London Centric in a statement. 'While foxes have been occasionally spotted at the site, their appearances have been brief and have had minimal impact on the ongoing construction.' But the four-legged residents have reportedly been digging burrows in the carefully landscaped grounds, with some people connected with the site having seen fox poop about the place. While London is famous for fox sightings, it's not clear how the animal managed to find its way to the roof of the 11-story building, which has been under construction since 2018. The building, designed by Thomas Heatherwick Studio and Bjarke Ingels Group, features the garden as a centerpiece and is supposed to be a shared space for not only Google workers but also bees, bats, birds, and butterflies. But not foxes. With the building set to welcome workers before the end of this year, there's still time to clear the garden of the pesky animal. But with foxes known to be resourceful and highly adaptable, getting rid of them may be a greater challenge than expected.