Latest news with #DebugGym

AI Still Struggles to Debug Code, But for How Long?

Yahoo

12-04-2025

Business
Yahoo

AI Still Struggles to Debug Code, But for How Long?

PCMag editors select and review products independently. If you buy through affiliate links, we may earn commissions, which help support our testing. If you're a programmer who is scared about AI taking your job, Microsoft's R&D division might have some promising news for you. Microsoft Research tested several top large language models (LLMs) and found that many come up short on common programming tasks. The study tested nine different models—including Anthropic's Claude 3.7 Sonnet, OpenAI's o1, and OpenAI's o3-mini—and assessed their ability to perform 'debugging,' the time-consuming process whereby programmers sift through existing code to find flaws that prevent it from working as intended. Microsoft hooked up the AIs to a third-party debugging assistant it created called Debug Gym and tested the AIs on a common software benchmark, SWE-bench. The study had mixed results, and none of the tools achieved even a 50% success rate, even with the help of Debug Gym. Anthropic's Claude 3.7 Sonnet was the best performer, managing to successfully debug the faulty code in 48.4% of cases. OpenAI's o1 achieved success 30.2% of the time, while OpenAI's o3-mini did so 22.1% of the time. Microsoft says it believes the AI tools can become effective code debuggers, but it needs "to fine-tune an info-seeking model specialized in gathering the necessary information to resolve bugs." The findings may provide some slight relief for worried programmers, as more of the tech world's largest names pivot toward using AI for coding. In October, Google announced it was using AI to write "a quarter of all new code." Meanwhile, AI startup Cognition Labs rolled out a new AI tool last year, dubbed Devin AI, that it claims can write code without human interference, complete engineering jobs on Upwork, and adjust its own AI models. Meta CEO Mark Zuckerberg, meanwhile, told podcaster Joe Rogan that his company will "have an AI that can effectively be a sort of mid-level engineer that you have at your company that can write code" at some point in 2025, and he expects other companies will do the same.

Latest news with #DebugGym

AI Still Struggles to Debug Code, But for How Long?

Get Started Now: Download the App