logo
OpenAI Researchers Find That Even the Best AI Is "Unable To Solve the Majority" of Coding Problems

OpenAI Researchers Find That Even the Best AI Is "Unable To Solve the Majority" of Coding Problems

Yahoo23-02-2025
OpenAI researchers have admitted that even the most advanced AI models still are no match for human coders — even though CEO Sam Altman insists they will be able to beat "low-level" software engineers by the end of this year.
In a new paper, the company's researchers found that even frontier models, or the most advanced and boundary-pushing AI systems, "are still unable to solve the majority" of coding tasks.
The researchers used a newly-developed benchmark called SWE-Lancer, built on more than 1,400 software engineering tasks from the freelancer site Upwork. Using the benchmark, OpenAI put three large language models (LLMs) — its own o1 reasoning model and flagship GPT-4o, as well as Anthropic's Claude 3.5 Sonnet — to the test.
Specifically, the new benchmark evaluated how well the LLMs performed with two types of tasks from Upwork: individual tasks, which involved resolving bugs and implementing fixes to them, or management tasks that saw the models trying to zoom out and make higher-level decisions. (The models weren't allowed to access the internet, meaning they couldn't just crib similar answers that'd been posted online.)
The models took on tasks cumulatively worth hundreds of thousands of dollars on Upwork, but they were only able to fix surface-level software issues, while remaining unable to actually find bugs in larger projects or find their root causes. These shoddy and half-baked "solutions" are likely familiar to anyone who's worked with AI — which is great at spitting out confident-sounding information that often falls apart on closer inspection.
Though all three LLMs were often able to operate "far faster than a human would," the paper notes, they also failed to grasp how widespread bugs were or to understand their context, "leading to solutions that are incorrect or insufficiently comprehensive."
As the researchers explained, Claude 3.5 Sonnet performed better than the two OpenAI models pitted against it and made more money than o1 and GPT-4o. Still, the majority of its answers were wrong, and according to the researchers, any model would need "higher reliability" to be trusted with real-life coding tasks.
Put more plainly, the paper seems to demonstrate that although these frontier models can work quickly and solve zoomed-in tasks, they're are nowhere near as skilled at handling them as human engineers.
Though these LLMs have advanced rapidly over the past few years and will likely continue to do so, they're not skilled enough at software engineering to replace real-life people quite yet — not that that's stopping CEOs from firing their human coders in favor of immature AI models.
More on AI and coding: Zuckerberg Announces Plans to Automate Facebook Coding Jobs With AI
Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

OpenAI's Greg Brockman says it's not too late to build AI startups
OpenAI's Greg Brockman says it's not too late to build AI startups

Business Insider

time15 minutes ago

  • Business Insider

OpenAI's Greg Brockman says it's not too late to build AI startups

If you're dreaming of joining the AI startup race, it might not be too late to start. "Sometimes it might feel like all the ideas are taken, but the economy is so big," Greg Brockman, OpenAI's cofounder and president, said in an episode of the "Latent Space" podcast released on Saturday. "It is worthwhile and really important for people to really think about how do we get the most out of these amazing intelligences that we've created." Brockman said startups that connect large language models to real-world applications are extremely valuable. Brockman, who cofounded OpenAI in 2015, added that domains like healthcare require founders to think about all the stakeholders and how they can insert AI models into the existing system. "There is so much fruit that is not yet picked, so go ahead and ride the GPT river," he said. Brockman also advised founders against building "better wrappers.""AI wrapper" is a dismissive term used to refer to simple applications that are built on top of existing AI models and can be easily offered by LLM companies themselves. "It's really about understanding a domain and building up expertise and relationships and all of those things," Brockman said. Brockman's comments are part of a Silicon Valley debate about how new AI founders can future-proof their startup ideas. Last year, OpenAI CEO Sam Altman said his company would "steamroll" any startup building "little things" on top of its model. He said that companies that underestimate the speed of AI model growth risk becoming part of the "OpenAI killed my startup meme." In a June podcast, Instagram cofounder and Anthropic's chief product officer, Mike Krieger, offered some advice for startups that want to avoid being made obsolete by LLM companies. Startups with deep knowledge in areas like law or biotechnology and those with good customer relationships can survive AI giants, Krieger said. He also suggested that startups play with new AI interfaces that feel "very weird" at first. "I don't envy them," he added, about founders wanting to build in the AI space. "Maybe that's part of the reason why I wanted to join a company rather than start one."

Apple faces Musk's wrath in explosive AI bias allegations
Apple faces Musk's wrath in explosive AI bias allegations

Yahoo

timean hour ago

  • Yahoo

Apple faces Musk's wrath in explosive AI bias allegations

Apple faces Musk's wrath in explosive AI bias allegations originally appeared on TheStreet. Elon Musk is threatening to drag Apple into court, accusing the tech giant of rigging its App Store to favor OpenAI's ChatGPT over his own AI chatbot, Grok. In a flurry of posts on X, Musk claimed Apple is making it "impossible for any AI company besides Open AI to reach #1 in the App Store," calling the practice an "unequivocal antitrust violation." Musk added that his AI startup, xAI, will take "immediate legal action" unless Apple changes course. Apple has so far declined to comment on Musk's allegations. The Grok vs ChatGPT battle Musk's frustration centers on Grok's exclusion from Apple's "Must-Have" section, even though Grok is ranked fifth among all free apps in the U.S. App Store. His social media platform X, which he claims is the "#1 news app in the world," is also missing from the "Must-Have" list. Meanwhile, OpenAI's ChatGPT sits at No. 1 in the free app rankings and enjoys prime placement in the "Must-Have" section. Apple is also prominently promoting ChatGPT-5, OpenAI's newest AI model, at the top of its "Apps" page."Are you playing politics?" Musk asked in one X post. Musk's legal threat comes shortly after Grok leapfrogged Google's AI app in Apple's App Store rankings following the launch of its new Grok 4 chatbot last month. The news also follows the recent departure of xAI legal head Robert Keele, who announced last week that he had left the company to spend more time with family. Keele also acknowledged there was "daylight between our worldviews," adding that Musk's "vision, commitment, and smarts blew me away on the daily." Musk's longstanding feud with OpenAI Musk co-founded OpenAI in 2015, but left its board in 2018. He has become one of its most vocal critics since then, suing the Microsoft-backed company and its CEO Sam Altman for allegedly abandoning its non-profit mission for developing AI "for the benefit of humanity broadly." More News: Warren Buffett's stock sends louder signals than Berkshire's earnings beat Veteran analyst spots unexpected star in Apple's earnings report Nvidia avoids White House crackdown; Trump softens on AI giant Musk had previously warned Apple not to integrate Open AI's technology at the operating system level, threatening to ban Apple devices from his companies if it did. While Apple has remained silent on the matter, Altman responded in his own X post, stating: "This is a remarkable claim given what I have heard alleged that Elon does to manipulate X to benefit himself and his own companies and harm his competitors and people he doesn't like."Continued antitrust scrutiny for Apple Apple is no stranger to antitrust challenges. The U.S. Department of Justice sued the company in 2024 for maintaining an alleged iPhone ecosystem monopoly. In 2021, courts forced Apple to loosen App Store payment restrictions in a major win for gaming companies after a judge maintained its fees were anticompetitive. If Musk follows through with his latest legal threat, his lawsuit could become another high-profile test of Apple's control over app distribution and promotion — this time in the booming AI industry. Whether Musk is simply posturing or preparing for an "immediate" legal blitz, the fight could set a new precedent for how app stores handle AI faces Musk's wrath in explosive AI bias allegations first appeared on TheStreet on Aug 13, 2025 This story was originally reported by TheStreet on Aug 13, 2025, where it first appeared. Sign in to access your portfolio

White House AI czar David Sacks says 'AI psychosis' is similar to the 'moral panic' of social media's early days
White House AI czar David Sacks says 'AI psychosis' is similar to the 'moral panic' of social media's early days

Yahoo

time6 hours ago

  • Yahoo

White House AI czar David Sacks says 'AI psychosis' is similar to the 'moral panic' of social media's early days

The White House AI advisor discussed "AI psychosis" on a recent podcast. David Sacks said he doubted the validity of the concept. He compared it to the "moral panic" that surrounded earlier tech leaps, like social media. AI can create a diet plan, organize a calendar, and provide answers to an endless variety of burning questions. Can it also cause a psychiatric breakdown? David Sacks, the White House official spearheading America's AI policies, doesn't think so. President Donald Trump's AI and crypto czar discussed "AI psychosis" during an episode of the "All-In Podcast" published Friday. While most people engage with chatbots without a problem, a small number of users say the bots have encouraged delusions and other concerning behavior. For some, ChatGPT serves as an alternative to professional therapists. A psychiatrist earlier told Business Insider that some of his patients exhibiting what's been described as "AI psychosis," a nonclinical term, used the technology before experiencing mental health issues, "but they turned to it in the wrong place at the wrong time, and it supercharged some of their vulnerabilities." During the podcast, Sacks doubted the whole concept of "AI psychosis." "I mean, what are we talking about here? People doing too much research?" he asked. "This feels like the moral panic that was created over social media, but updated for AI." Sacks then referred to a recent article featuring a psychiatrist, who said they didn't believe using a chatbot inherently induced "AI psychosis" if there aren't other risk factors — including social and genetic — involved. "In other words, this is just a manifestation or outlet for pre-existing problems," Sacks said. "I think it's fair to say we're in the midst of a mental health crisis in this country." Sacks attributed the crisis instead to the COVID-19 pandemic and related lockdowns. "That's what seems to have triggered a lot of these mental health declines," he said. After several reports of users suffering mental breaks while using ChatGPT, OpenAI CEO Sam Altman addressed the issue on X after the company rolled out the highly anticipated GPT-5. "People have used technology, including AI, in self-destructive ways; if a user is in a mentally fragile state and prone to delusion, we do not want the AI to reinforce that," Altman wrote. "Most users can keep a clear line between reality and fiction or role-play, but a small percentage cannot." Earlier this month, OpenAI introduced safeguards in ChatGPT, including a prompt encouraging users to take breaks after long conversations with the chatbot. The update will also change how the chatbot responds to users asking about personal challenges. Read the original article on Business Insider Solve the daily Crossword

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store