Latest news with #Claude3.7Sonnet

xAI hired gig workers to boost Grok on a key AI leaderboard and 'beat' Anthropic's Claude in coding

Business Insider

6 hours ago

Business
Business Insider

xAI hired gig workers to boost Grok on a key AI leaderboard and 'beat' Anthropic's Claude in coding

Tech companies are fiercely competing to build the best AI coding tools — and for xAI, the top rival to beat seems to be Anthropic. Elon Musk's AI company used contractors to train Grok on coding tasks with the goal of topping a popular AI leaderboard, and explicitly told them they wanted it to outperform Anthropic's Claude 3.7 Sonnet tool, documents obtained by Business Insider show. The contractors, hired through Scale AI's Outlier platform, were assigned a project to "hillclimb" Grok's ranking on WebDev Arena, an influential leaderboard from LMArena that pits AI models against each other in web development challenges, with users voting for the winner. "We want to make the in-task model the #1 model" for LMArena, reads one Scale AI onboarding doc that was active in early July, according to one contractor who worked on the project. Contractors were told to generate and refine front-end code for user interface prompts to "beat Sonnet 3.7 Extended," a reference to Anthropic's Claude model. xAI did not reply to a BI request for comment. In the absence of universally agreed-upon standards, leaderboard rankings and benchmark scores have become the AI industry's unofficial scoreboard. For labs like OpenAI and Anthropic, topping these rankings can help attract funding, new customers, lucrative contracts, and media attention. Anthropic's Claude, which has multiple models, is considered one of the leading players for AI coding and consistently ranks near the top of many leaderboards, often alongside Google and OpenAI. Anthropic cofounder Benn Mann said on the "No Priors" podcast last month that other companies had declared "code reds" to try to match Claude's coding abilities, and he was surprised that other models hadn't caught up. Competitors like Meta are using Anthropic's coding tools internally, BI previously reported. The Scale AI dashboard and project instructions did not specify which version of Grok the project was training, though it was in use days before the newest model, Grok 4, came out on July 9. On Tuesday, LMArena ranked Grok 4 in 12th place for web development. Models from Anthropic ranked in joint first, third, and fourth. The day after Grok 4's launch, Musk posted on X claiming that the new model "works better than Cursor" at fixing code, referring to the popular AI-assisted developer tool. You can cut & paste your entire source code file into the query entry box on and @Grok 4 will fix it for you! This is what everyone @xAI does. Works better than Cursor. — Elon Musk (@elonmusk) July 10, 2025 In a comment to BI, Scale AI said it does not overfit models by training them directly on a test set. The company said it never copies or reuses public benchmark data for large language model training and told BI it was engaging in a "standard data generation project using public signals to close known performance gaps." Anastasios Angelopoulos, the CEO of LMArena, told BI that while he wasn't aware of the specific Scale project, hiring contractors to help AI models climb public leaderboards is standard industry practice. "This is part of the standard workflow of model training. You need to collect data to improve your model," Angelopoulos said, adding that it's "not just to do well in web development, but in any benchmark." The race for leaderboard dominance The industry's focus on AI leaderboards can drive intense — and not always fair — competition. Sara Hooker, the head of Cohere Labs and one of the authors of " The Leaderboard Illusion," a paper published by researchers from universities including MIT and Stanford, told BI that "when a leaderboard is important to a whole ecosystem, the incentives are aligned for it to be gamed." In April, after Meta's Llama 4 model shot up to second place on LM Arena, developers noticed that the model variant that Meta used for public benchmarking was different from the version released to the public. This sparked accusations from AI researchers that Meta was gaming the leaderboard. Meta denied the claims, saying the variant in question was experimental and that evaluating multiple versions of a model is standard practice. Although xAI's project with Scale AI asked contractors to help "hillclimb" the LMArena rankings, there is no evidence that they were gaming the leaderboard. Leaderboard dominance doesn't always translate into real-world ability. Shivalika Singh, another author of the paper, told BI that "doing well on the Arena doesn't result in generally good performance" or guarantee strong results on other benchmarks. Overall, Grok 4 ranked in the top three for LMArena's core categories of math, coding, and "Hard Prompts." However, early data from Yupp, a new crowdsourced leaderboard and LMArena rival, showed that Grok 4 ranked 66 out of more than 100 models, highlighting the variance between leaderboards. Nate Jones, an AI strategist and product leader with a widely read newsletter, said he found Grok's actual abilities often lagged behind its leaderboard hype. "Grok 4 crushed some flashy benchmarks, but when the rubber met the road in my tests this week Grok 4 stumbled hard," he wrote in his Substack on Monday. "The moment we set leaderboard dominance as the goal, we risk creating models that excel in trivial exercises and flounder when facing reality."

Tracking AI models' ‘thoughts' could reveal how they make decisions, researchers say

Indian Express

a day ago

Science
Indian Express

Tracking AI models' ‘thoughts' could reveal how they make decisions, researchers say

A broad coalition drawn from the ranks of multiple AI companies, universities, and non-profit organisations have called for deeper scrutiny of AI reasoning models, particularly their 'thoughts' or reasoning traces. In a new position paper published on Tuesday, July 15, the authors said that monitoring the chains-of-thought (CoT) by AI reasoning models could be pivotal to keeping AI agents in check. Reasoning models such as OpenAI's o3 differ from large language models (LLMs) such as GPT-4o as the former is said to follow an externalised process where they work out the problem step-by-step before generating an answer, according to a report by TechCrunch. Reasoning models can be used to perform tasks such as solving complex math and science problems. They also serve as the underlying technology for AI agents capable of autonomously accessing the internet, visiting websites, making hotel reservations, etc, on behalf of users. This push to advance AI safety research could help shed light on how AI reasoning models work, an area that remains poorly understood despite these models reportedly improving the overall performance of AI on benchmarks. 'CoT monitoring presents a valuable addition to safety measures for frontier AI, offering a rare glimpse into how AI agents make decisions,' the paper reads. 'Yet, there is no guarantee that the current degree of visibility will persist. We encourage the research community and frontier AI developers to make the best use of CoT monitorability and study how it can be preserved,' it adds. The paper calls on leading AI model developers to determine whether CoT reasoning is 'monitorable' and to track its monitorability. It urges deeper research on the factors that could shed more light on how these AI models arrive at answers. AI developers should also look into whether CoT reasoning can be used as a safeguard to prevent AI-related harms, as per the document. But, the paper carries a cautionary note as well. It suggests that any interventions should not make the AI reasoning models less transparent or reliable. In September last year, OpenAI released a preview of its first-ever AI reasoning model called o1. This launch prompted other companies to release competing models with similar capabilities such as Gemini 2.0, Claude 3.7 Sonnet, and xAI's Grok 3, among others. Anthropic researchers have been studying AI reasoning models, with a recent academic study suggesting that AI models can fake CoT reasoning. Another research paper from OpenAI found that CoT monitoring could enable better alignment of AI models with human behaviour and values.

Apple Paper questions path to AGI, sparks division in GenAI group

Time of India

14-06-2025

Science
Time of India

Apple Paper questions path to AGI, sparks division in GenAI group

New Delhi: A recent research paper from Apple focusing on the limitations of large reasoning models in artificial intelligence has left the generative AI community divided, sparking significant debate whether the current path taken by AI companies towards artificial general intelligence is the right one to take. What did Apple find? The paper, titled The Illusion of Thinking, published earlier this week, demonstrates that even the most sophisticated large reasoning models do not genuinely think or reason in a human-like way. Instead, they excel at pattern recognition and mimicry, generating responses that only appear intelligent, but lack true comprehension or conceptual understanding. The study used controlled puzzle environments, such as the popular Tower of Hanoi puzzle, to systematically test reasoning abilities across varying complexities by large reasoning models such as OpenAI's o3 Mini, DeepSeek's R1, Anthropic's Claude 3.7 Sonnet and Google Gemini Flash. The findings show that while large reasoning and language models may handle simple or moderately complex tasks, they experience total failure when faced with high-complexity problems, which occur despite having sufficient computational resources. Widespread support for Apple's findings Gary Marcus, a cognitive scientist and a known sceptic of the claims surrounding large language models, views Apple's work as providing compelling empirical evidence that today's models primarily repeat patterns learned during training from vast datasets without genuine understanding or true reasoning capabilities. "If you can't use a billion-dollar AI system to solve a problem that Herb Simon (one of the actual godfathers of AI, current hype aside) solved with AI in 1957, and that first semester AI students solve routinely, the chances that models like Claude or o3 are going to reach AGI seem truly remote," Marcus wrote in his blog. Marcus' arguments are also echoed in earlier comments of Meta's chief AI scientist Yann LeCun, who has argued that current AI systems are mainly sophisticated pattern recognition tools rather than true thinkers. The release of Apple's paper ignited a polarised debate across the broader AI community, with many panning the design of the study than its findings. On the other hand... A published critique of the paper by researchers from Anthropic and San-Francisco based Open Philanthropy said the study has issues in the experimental design, that it overlooks output limits. In an alternate demonstration, the researchers tested the models on the same problems but allowed them to use code, resulting in high accuracy across all the tested models. The critique around the study's failure to take in the output limits and the limitations in coding by the models have also been highlighted by other AI commentators and researchers including Matthew Berman, a popular AI commentator and researcher. "SOTA models failed The Tower of Hanoi puzzle at a complexity threshold of >8 discs when using natural language alone to solve it. However, ask it to write code to solve it, and it flawlessly does up to seemingly unlimited complexity," Berman wrote in a post on X (formerly Twitter). Industry impact The study highlights Apple's more cautious approach to AI compared to rivals like Google and Samsung, who have aggressively integrated AI into their products. Apple's research explains its hesitancy to fully commit to AI, contrasting with the industry's prevailing narrative of rapid progress. Many questioned the timing of the release of the study, coinciding with Apple's annual WWDC event where it announces its next software updates. Chatter across online forums said the study was more about managing expectations in light of Apple's own struggles with AI. That said, practitioners and business users argue that the findings do not change the immediate utility of AI tools for everyday applications.

Time of India

13-06-2025

Science
Time of India

Apple Paper questions path to AGI, sparks division in GenAI group

New Delhi: A recent research paper from Apple focusing on the limitations of large reasoning models in artificial intelligence has left the generative AI community divided, sparking significant debate whether the current path taken by AI companies towards artificial general intelligence is the right one to take. What did Apple find? The paper, titled The Illusion of Thinking, published earlier this week, demonstrates that even the most sophisticated large reasoning models do not genuinely think or reason in a human-like way. Instead, they excel at pattern recognition and mimicry, generating responses that only appear intelligent, but lack true comprehension or conceptual understanding. The study used controlled puzzle environments, such as the popular Tower of Hanoi puzzle, to systematically test reasoning abilities across varying complexities by large reasoning models such as OpenAI's o3 Mini, DeepSeek's R1, Anthropic's Claude 3.7 Sonnet and Google Gemini Flash. The findings show that while large reasoning and language models may handle simple or moderately complex tasks, they experience total failure when faced with high-complexity problems, which occur despite having sufficient computational resources. Widespread support for Apple's findings Gary Marcus, a cognitive scientist and a known sceptic of the claims surrounding large language models, views Apple's work as providing compelling empirical evidence that today's models primarily repeat patterns learned during training from vast datasets without genuine understanding or true reasoning capabilities. "If you can't use a billion-dollar AI system to solve a problem that Herb Simon (one of the actual godfathers of AI, current hype aside) solved with AI in 1957, and that first semester AI students solve routinely, the chances that models like Claude or o3 are going to reach AGI seem truly remote," Marcus wrote in his blog. Marcus' arguments are also echoed in earlier comments of Meta's chief AI scientist Yann LeCun, who has argued that current AI systems are mainly sophisticated pattern recognition tools rather than true thinkers. Live Events The release of Apple's paper ignited a polarised debate across the broader AI community, with many panning the design of the study than its findings. On the other hand... A published critique of the paper by researchers from Anthropic and San-Francisco based Open Philanthropy said the study has issues in the experimental design, that it overlooks output limits. In an alternate demonstration, the researchers tested the models on the same problems but allowed them to use code, resulting in high accuracy across all the tested models. The critique around the study's failure to take in the output limits and the limitations in coding by the models have also been highlighted by other AI commentators and researchers including Matthew Berman, a popular AI commentator and researcher. "SOTA models failed The Tower of Hanoi puzzle at a complexity threshold of >8 discs when using natural language alone to solve it. However, ask it to write code to solve it, and it flawlessly does up to seemingly unlimited complexity," Berman wrote in a post on X (formerly Twitter). Industry impact The study highlights Apple's more cautious approach to AI compared to rivals like Google and Samsung, who have aggressively integrated AI into their products. Apple's research explains its hesitancy to fully commit to AI, contrasting with the industry's prevailing narrative of rapid progress. Many questioned the timing of the release of the study, coinciding with Apple's annual WWDC event where it announces its next software updates. Chatter across online forums said the study was more about managing expectations in light of Apple's own struggles with AI. That said, practitioners and business users argue that the findings do not change the immediate utility of AI tools for everyday applications.

Apple researchers find ‘major' flaws in AI reasoning models ahead of WWDC 2025

Time of India

09-06-2025

Science
Time of India

Apple researchers find ‘major' flaws in AI reasoning models ahead of WWDC 2025

A newly published Apple Machine Learning Research study has challenged the prevailing idea that large-language models (LLMs) like OpenAI's o1 and Claude's thinking variants truly possess "reasoning" capabilities. The study indicates fundamental limitations in these AI systems. For this study, Apple researchers designed controllable puzzle environments, such as the Tower of Hanoi and the River Crossing. This approach avoided standard math benchmarks, which are susceptible to data contamination. According to the researchers, these custom environments allowed for a precise analysis of both the final answers produced by the LLMs and their internal reasoning traces across different complexity levels. What Apple researchers have found out from this study According to a report by MacRumors, the reasoning models tested by Apple's Research team, including o3-mini, DeepSeek-R1, and Claude 3.7 Sonnet, saw their accuracy collapse entirely once problem complexity crossed certain thresholds. Success rates dropped to zero even though the models had sufficient computational resources. Surprisingly, as problems became harder, the models reduced their reasoning effort. This points to fundamental scaling limitations rather than a lack of resources. Even more revealing, the models still failed at the same complexity points even when researchers provided complete solution algorithms. This indicates that the limitation lies in basic logical step execution, not in choosing the right problem-solving strategy. The models also showed puzzling inconsistencies. They were able to solve problems requiring over 100 moves but failed on simpler puzzles that needed only 11 moves. The study identified three performance patterns. Standard models unexpectedly performed better than reasoning models on low-complexity problems. Reasoning models had an advantage at medium complexity. Both types failed at high complexity. Researchers also discovered that models exhibited inefficient "overthinking" patterns, often discovering correct solutions early but wasting computational effort exploring incorrect alternatives. The key takeaway is that current "reasoning" models rely heavily on advanced pattern matching, not true reasoning. These models do not scale their reasoning the way humans do. They tend to overthink easy problems and think less when faced with harder ones. It is worth noting that this research surfaced just days before WWDC 2025. According to Bloomberg, Apple is expected to focus on new software designs rather than headline-grabbing AI features at this year's event. AI Masterclass for Students. Upskill Young Ones Today!– Join Now

Latest news with #Claude3.7Sonnet

xAI hired gig workers to boost Grok on a key AI leaderboard and 'beat' Anthropic's Claude in coding

Tracking AI models' ‘thoughts' could reveal how they make decisions, researchers say

Apple Paper questions path to AGI, sparks division in GenAI group

Apple Paper questions path to AGI, sparks division in GenAI group

Apple researchers find ‘major' flaws in AI reasoning models ahead of WWDC 2025

Get Started Now: Download the App