logo
xAI hired gig workers to boost Grok on a key AI leaderboard and 'beat' Anthropic's Claude in coding

xAI hired gig workers to boost Grok on a key AI leaderboard and 'beat' Anthropic's Claude in coding

Tech companies are fiercely competing to build the best AI coding tools — and for xAI, the top rival to beat seems to be Anthropic.
Elon Musk's AI company used contractors to train Grok on coding tasks with the goal of topping a popular AI leaderboard, and explicitly told them they wanted it to outperform Anthropic's Claude 3.7 Sonnet tool, documents obtained by Business Insider show.
The contractors, hired through Scale AI's Outlier platform, were assigned a project to "hillclimb" Grok's ranking on WebDev Arena, an influential leaderboard from LMArena that pits AI models against each other in web development challenges, with users voting for the winner.
"We want to make the in-task model the #1 model" for LMArena, reads one Scale AI onboarding doc that was active in early July, according to one contractor who worked on the project. Contractors were told to generate and refine front-end code for user interface prompts to "beat Sonnet 3.7 Extended," a reference to Anthropic's Claude model.
xAI did not reply to a BI request for comment.
In the absence of universally agreed-upon standards, leaderboard rankings and benchmark scores have become the AI industry's unofficial scoreboard.
For labs like OpenAI and Anthropic, topping these rankings can help attract funding, new customers, lucrative contracts, and media attention.
Anthropic's Claude, which has multiple models, is considered one of the leading players for AI coding and consistently ranks near the top of many leaderboards, often alongside Google and OpenAI.
Anthropic cofounder Benn Mann said on the "No Priors" podcast last month that other companies had declared "code reds" to try to match Claude's coding abilities, and he was surprised that other models hadn't caught up. Competitors like Meta are using Anthropic's coding tools internally, BI previously reported.
The Scale AI dashboard and project instructions did not specify which version of Grok the project was training, though it was in use days before the newest model, Grok 4, came out on July 9.
On Tuesday, LMArena ranked Grok 4 in 12th place for web development. Models from Anthropic ranked in joint first, third, and fourth.
The day after Grok 4's launch, Musk posted on X claiming that the new model "works better than Cursor" at fixing code, referring to the popular AI-assisted developer tool.
You can cut & paste your entire source code file into the query entry box on https://t.co/EqiIFyHFlo and @Grok 4 will fix it for you!
This is what everyone @xAI does. Works better than Cursor.
— Elon Musk (@elonmusk) July 10, 2025
In a comment to BI, Scale AI said it does not overfit models by training them directly on a test set. The company said it never copies or reuses public benchmark data for large language model training and told BI it was engaging in a "standard data generation project using public signals to close known performance gaps."
Anastasios Angelopoulos, the CEO of LMArena, told BI that while he wasn't aware of the specific Scale project, hiring contractors to help AI models climb public leaderboards is standard industry practice.
"This is part of the standard workflow of model training. You need to collect data to improve your model," Angelopoulos said, adding that it's "not just to do well in web development, but in any benchmark."
The race for leaderboard dominance
The industry's focus on AI leaderboards can drive intense — and not always fair — competition.
Sara Hooker, the head of Cohere Labs and one of the authors of " The Leaderboard Illusion," a paper published by researchers from universities including MIT and Stanford, told BI that "when a leaderboard is important to a whole ecosystem, the incentives are aligned for it to be gamed."
In April, after Meta's Llama 4 model shot up to second place on LM Arena, developers noticed that the model variant that Meta used for public benchmarking was different from the version released to the public. This sparked accusations from AI researchers that Meta was gaming the leaderboard.
Meta denied the claims, saying the variant in question was experimental and that evaluating multiple versions of a model is standard practice.
Although xAI's project with Scale AI asked contractors to help "hillclimb" the LMArena rankings, there is no evidence that they were gaming the leaderboard.
Leaderboard dominance doesn't always translate into real-world ability. Shivalika Singh, another author of the paper, told BI that "doing well on the Arena doesn't result in generally good performance" or guarantee strong results on other benchmarks.
Overall, Grok 4 ranked in the top three for LMArena's core categories of math, coding, and "Hard Prompts."
However, early data from Yupp, a new crowdsourced leaderboard and LMArena rival, showed that Grok 4 ranked 66 out of more than 100 models, highlighting the variance between leaderboards.
Nate Jones, an AI strategist and product leader with a widely read newsletter, said he found Grok's actual abilities often lagged behind its leaderboard hype.
"Grok 4 crushed some flashy benchmarks, but when the rubber met the road in my tests this week Grok 4 stumbled hard," he wrote in his Substack on Monday. "The moment we set leaderboard dominance as the goal, we risk creating models that excel in trivial exercises and flounder when facing reality."
Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

How Elon Musk Caused This Squirrel Memecoin To Spike With An Epstein Rant
How Elon Musk Caused This Squirrel Memecoin To Spike With An Epstein Rant

Yahoo

time18 minutes ago

  • Yahoo

How Elon Musk Caused This Squirrel Memecoin To Spike With An Epstein Rant

Benzinga and Yahoo Finance LLC may earn commission or revenue on some items through the links below. Tesla (NASDAQ:TSLA) CEO Elon Musk is once again sending ripples through the cryptocurrency market with his tweets. The billionaire has sent Peanut the Squirrel, a Solana memecoin inspired by a pet squirrel, which went viral after it was euthanized by the state of New York, soaring with a Tuesday rant. The rant came in response to the government's continued failure to release the alleged "Epstein list." The Epstein list is a purported compilation of famous persons who were co-conspirators in Epstein's sex crimes in the custody of the government, per conspiracy theorists. Don't Miss: — no wallets, just price speculation and free paper trading to practice different strategies. Grow your IRA or 401(k) with Crypto – . "They arrested (and killed) Peanut, but have not even tried to file charges against anyone on the Epstein client list," Musk said on X. "Government is deeply broken." The remark was accompanied by a "The Office" meme that expanded on these views. "More squirrels and raccoons have been arrested than anyone on Epstein's client list," the caption read. Within minutes of Musk's post, PNUT's price spiked by 5% from $0.22 to $0.23. The token's trading volume also surged about 100% from around $100 million to over $200 million. At last look, the token is trading at $0.24, continuing the Musk-led rally after a brief correction. Trending: New to crypto? on Coinbase. Recurring Antics The PNUT rally is the latest in a long line of incidents where Musk's online antics have sparked speculative cryptocurrency activity. In May, Musk changed his profile name to "gorklon rust," a mashup of references parodying xAI's Grok chatbot, while also changing his profile picture to funkier version of Grok's profile picture, sparking significant rallies in "Gork'-themed memecoins. Before gorklon rust, it was "Harry Bōlz," in February, in a juvenile attempt to get mainstream news anchors to say the name out loud. In late December and early January, it was "Kekius Maximus." But Musk is perhaps best known for his advocacy of Dogecoin. The memecoin's 15,500% run to its all-time high of $0.73 in 2021 is widely attributed to his vocal Manipulation Concerns Musk's antics have sparked market manipulation concerns over the years, as resulting market rallies are often followed by equally violent market crashes. And in 2022, a group of investors attempted to cash in on these concerns, filing a lawsuit against Musk and Tesla that alleged that they had manipulated the price of Dogecoin for their profit. But a federal judge dismissed the lawsuit in 2024, ruling that Musk's statements supporting the memecoin were "aspirational and puffery, not factual," adding that 'no reasonable investor could rely upon them.' Musk, for his part, has said people should only invest in memecoins for fun, likening it to a casino. "It's like a casino," he said in a February episode of the "Joe Rogan Experience" podcast. "If you expect to win in a casino, you are being a fool. So if you expect to win at memecoins, you are being foolish. You are not going to win at memecoins. Don't sink your life savings into a memecoin." Read Next: Warren Buffett once said, "If you don't find a way to make money while you sleep, you will work until you die." Here's , starting today. Image: Shutterstock This article How Elon Musk Caused This Squirrel Memecoin To Spike With An Epstein Rant originally appeared on

Eric Schmidt explains why he doesn't think AI is a bubble — even if it might look like it
Eric Schmidt explains why he doesn't think AI is a bubble — even if it might look like it

Business Insider

timean hour ago

  • Business Insider

Eric Schmidt explains why he doesn't think AI is a bubble — even if it might look like it

Eric Schmidt took over as Google's CEO in the midst of the dot-com bubble burst. He doesn't anticipate the same fate for AI. The former Google executive explained why he didn't think the AI industry was in a bubble while speaking at the RAISE Summit in Paris. AI has expanded rapidly in the years since ChatGPT took off and Big Tech invested heavily in the industry and ignited a new talent war. With an estimated market value of $189 billion in 2023, it's projected to grow into a $4.8 trillion industry by 2033. While some may see signs of an eventual crash, Schmidt — who has investments in multiple AI companies, including Anthropic — pointed to hardware and the chips market as a specific sign that the market has longevity. "You have these massive data centers, and Nvidia is quite happy to sell them all the chips," Schmidt said. "I've never seen a situation where hardware capacity was not taken up by software." Schmidt, speaking about his conversations with AI executives, said that he's heard talk that the AI industry is in a "period of overbuilding," and that they'll hit "overcapacity in two or three years." "They'll say, 'But I'll be fine and the other guys are going to lose all their money,'" Schmidt said. "That's a classic bubble, right?" Then there's the other side of the debate, the Bay Area techies who think that reinforcement learning chains will transform the world. "If you believe that those are going to be the defining aspects of humanity, then it's under-hyped and we need even more," he said. Schmidt didn't side with either side — overcapacity or under-expansion — but he did weigh in on whether it was an industry facing a bubble-level correction. "I think it's it's unlikely, based on my experience, that this is a bubble," Schmidt said. "It's much more likely that you're seeing an whole new industrial structure." Not everyone agrees. On Wall Street, talk of a potential bubble continues to simmer. On Wednesday, Apollo Global Management's chief economist Torsten Sløk said that the stock market faces an even bigger bubble than the dot-com boom. The primary culprit, in his view: AI. "The difference between the IT bubble in the 1990s and the AI bubble today is that the top 10 companies in the S&P 500 today are more overvalued than they were in the 1990s," Sløk wrote.

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store