AI Still Struggles to Debug Code, But for How Long?
PCMag editors select and review products independently. If you buy through affiliate links, we may earn commissions, which help support our testing.
If you're a programmer who is scared about AI taking your job, Microsoft's R&D division might have some promising news for you. Microsoft Research tested several top large language models (LLMs) and found that many come up short on common programming tasks.
The study tested nine different models—including Anthropic's Claude 3.7 Sonnet, OpenAI's o1, and OpenAI's o3-mini—and assessed their ability to perform 'debugging,' the time-consuming process whereby programmers sift through existing code to find flaws that prevent it from working as intended. Microsoft hooked up the AIs to a third-party debugging assistant it created called Debug Gym and tested the AIs on a common software benchmark, SWE-bench.
The study had mixed results, and none of the tools achieved even a 50% success rate, even with the help of Debug Gym. Anthropic's Claude 3.7 Sonnet was the best performer, managing to successfully debug the faulty code in 48.4% of cases. OpenAI's o1 achieved success 30.2% of the time, while OpenAI's o3-mini did so 22.1% of the time.
Microsoft says it believes the AI tools can become effective code debuggers, but it needs "to fine-tune an info-seeking model specialized in gathering the necessary information to resolve bugs."
The findings may provide some slight relief for worried programmers, as more of the tech world's largest names pivot toward using AI for coding. In October, Google announced it was using AI to write "a quarter of all new code." Meanwhile, AI startup Cognition Labs rolled out a new AI tool last year, dubbed Devin AI, that it claims can write code without human interference, complete engineering jobs on Upwork, and adjust its own AI models.
Meta CEO Mark Zuckerberg, meanwhile, told podcaster Joe Rogan that his company will "have an AI that can effectively be a sort of mid-level engineer that you have at your company that can write code" at some point in 2025, and he expects other companies will do the same.
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles


Fast Company
43 minutes ago
- Fast Company
Vibe coding lets anyone write software—but comes with risks
Whether you're streaming a show, paying bills online or sending an email, each of these actions relies on computer programs that run behind the scenes. The process of writing computer programs is known as coding. Until recently, most computer code was written, at least originally, by human beings. But with the advent of generative artificial intelligence, that has begun to change. Just as you can ask ChatGPT to spin up a recipe for a favorite dish or write a sonnet in the style of Lord Byron, now you can ask generative AI tools to write computer code for you. Andrej Karpathy, an OpenAI co-founder who previously led AI efforts at Tesla, recently termed this ' vibe coding.' For complete beginners or nontechnical dreamers, writing code based on vibes—feelings rather than explicitly defined information—could feel like a superpower. You don't need to master programming languages or complex data structures. A simple natural language prompt will do the trick. How it works Vibe coding leans on standard patterns of technical language, which AI systems use to piece together original code from their training data. Any beginner can use an AI assistant such as GitHub Copilot or Cursor Chat, put in a few prompts, and let the system get to work. Here's an example: 'Create a lively and interactive visual experience that reacts to music, user interaction, or real-time data. Your animation should include smooth transitions and colorful and lively visuals with an engaging flow in the experience. The animation should feel organic and responsive to the music, user interaction, or live data and facilitate an experience that is immersive and captivating. Complete this project using JavaScript or React, and allow for easy customization to set the mood for other experiences.' But AI tools do this without any real grasp of specific rules, edge cases, or security requirements for the software in question. This is a far cry from the processes behind developing production-grade software, which must balance trade-offs between product requirements, speed, scalability, sustainability, and security. Skilled engineers write and review the code, run tests, and establish safety barriers before going live. But while the lack of a structured process saves time and lowers the skills required to code, there are trade-offs. With vibe coding, most of these stress-testing practices go out the window, leaving systems vulnerable to malicious attacks and leaks of personal data. And there's no easy fix: If you don't understand every—or any—line of code that your AI agent writes, you can't repair the code when it breaks. Or worse, as some experts have pointed out, you won't notice when it's silently failing. The AI itself is not equipped to carry out this analysis either. It recognizes what 'working' code usually looks like, but it cannot necessarily diagnose or fix deeper problems that the code might cause or exacerbate. Why it matters Vibe coding could be just a flash-in-the-pan phenomenon that will fizzle before long, but it may also find deeper applications with seasoned programmers. The practice could help skilled software engineers and developers more quickly turn an idea into a viable prototype. It could also enable novice programmers or even amateur coders to experience the power of AI, perhaps motivating them to pursue the discipline more deeply. Vibe coding also may signal a shift that could make natural language a more viable tool for developing some computer programs. If so, it would echo early website editing systems known as WYSIWYG editors that promised designers 'what you see is what you get,' or 'drag-and-drop' website builders that made it easy for anyone with basic computer skills to launch a blog. For now, I don't believe that vibe coding will replace experienced software engineers, developers, or computer scientists. The discipline and the art are much more nuanced than what AI can handle, and the risks of passing off 'vibe code' as legitimate software are too great. But as AI models improve and become more adept at incorporating context and accounting for risk, practices like vibe coding might cause the boundary between AI and human programmer to blur further.
Yahoo
5 hours ago
- Yahoo
Apple's Siri Could Be More Like ChatGPT. But Is That What You Want?
I've noticed a vibe shift in the appetite for AI on our devices. My social feeds are flooded with disgust over what's being created by Google's AI video generator tool, Veo 3. The unsettling realistic video of fake people and voices it creates makes it clear we will have a hard time telling apart fiction from reality. In other words, the AI slop is looking less sloppy. Meanwhile, the CEO of Anthropic is warning people that AI will wipe out half of all entry-level white-collar jobs. In an interview with Axios, Dario Amodei is suggesting government needs to step in to protect us from a mass elimination of jobs that can happen very rapidly. So as we gear up for Apple's big WWDC presentation on Monday, I have a different view of headlines highlighting Apple being behind in the AI race. I wonder, what exactly is the flavor of AI that people want or need right now? And will it really matter if Apple keeps waiting longer to push out it's long promised (and long delayed) personalized Siri when people are not feeling optimistic about AI's impact on our society? In this week's episode of One More Thing, which you can watch embedded above, I go over some of the recent reporting from Bloomberg that discusses leadership changes on the Siri team, and how there are different views in what consumers want out of Siri. Should Apple approach AI in a way to make Siri into a home-grown chatbot, or just make it a better interface for controlling devices? (Maybe a bit of both.) I expect a lot of griping after WWDC about the state of Siri and Apple's AI, with comparisons to other products like ChatGPT. But I hope we can use those gripes to voice what we really want in the next path for the assistant, by sharing our thoughts and speaking with our wallet. Do you want a Siri that's better at understanding context, or one that goes further and makes decisions for you? It's a question I'll be dwelling on more as Apple gives us the next peak into the future of iOS on Monday, and perhaps a glimpse of how the next Siri is shaping up. If you're looking for more One More Thing, subscribe to our YouTube page to catch Bridget Carey breaking down the latest Apple news and issues every Friday.

Business Insider
7 hours ago
- Business Insider
AI leaders have a new term for the fact that their models are not always so intelligent
As academics, independent developers, and the biggest tech companies in the world drive us closer to artificial general intelligence — a still hypothetical form of intelligence that matches human capabilities — they've hit some roadblocks. Many emerging models are prone to hallucinating, misinformation, and simple errors. Google CEO Sundar Pichai referred to this phase of AI as AJI, or "artificial jagged intelligence," on a recent episode of Lex Fridman's podcast. "I don't know who used it first, maybe Karpathy did," Pichai said, referring to deep learning and computer vision specialist Andrej Karpathy, who cofounded OpenAI before leaving last year. AJI is a bit of a metaphor for the trajectory of AI development — jagged, marked at once by sparks of genius and basic mistakes. In a 2024 X post titled "Jagged Intelligence," Karpathy described the term as a "word I came up with to describe the (strange, unintuitive) fact that state of the art LLMs can both perform extremely impressive tasks (e.g. solve complex math problems) while simultaneously struggle with some very dumb problems." He then posted examples of state of the art large language models failing to understand that 9.9 is bigger than 9.11, making "non-sensical decisions" in a game of tic-tac-toe, and struggling to count. The issue is that unlike humans, "where a lot of knowledge and problem-solving capabilities are all highly correlated and improve linearly all together, from birth to adulthood," the jagged edges of AI are not always clear or predictable, Karpathy said. Pichai echoed the idea. "You see what they can do and then you can trivially find they make numerical errors or counting R's in strawberry or something, which seems to trip up most models," Pichai said. "I feel like we are in the AJI phase where dramatic progress, some things don't work well, but overall, you're seeing lots of progress." In 2010, when Google DeepMind launched, its team would talk about a 20-year timeline for AGI, Pichai said. Google subsequently acquired DeepMind in 2014. Pichai thinks it'll take a little longer than that, but by 2030, "I would stress it doesn't matter what that definition is because you will have mind-blowing progress on many dimensions." By then the world will also need a clear system for labeling AI-generated content to "distinguish reality," he said. "Progress" is a vague term, but Pichai has spoken at length about the benefits we'll see from AI development. At the UN's Summit of the Future in September 2024, he outlined four specific ways that AI would advance humanity — improving access to knowledge in native languages, accelerating scientific discovery, mitigating climate disaster, and contributing to economic progress.