Latest news with #SWE-bench

Anthropic Launches Claude 4 Sonnet and Opus: Smarter AI Models for Coding and Complex Reasoning

Hans India

23-05-2025

Business
Hans India

Anthropic Launches Claude 4 Sonnet and Opus: Smarter AI Models for Coding and Complex Reasoning

Anthropic, the AI research company founded by former OpenAI employees, has officially launched two advanced AI models — Claude 4 Sonnet and Claude 4 Opus — at its first-ever developer conference. Designed to push the limits of coding, reasoning, and long-term task performance, the company claims that Opus 4 is now the most powerful AI coding model in the world. 'Claude Opus 4 is the world's best coding model, with sustained performance on complex, long-running tasks and agent workflows,' the company stated in its official blog post. These models represent Anthropic's next major leap in AI development. Claude Opus 4 is tailored for developers managing intensive, time-consuming tasks and large-scale workflows. Claude Sonnet 4, meanwhile, is a refined, faster alternative to its predecessor, Claude Sonnet 3.7, offering solid performance in more general-use scenarios — and it's now accessible to free-tier users. In contrast, access to Claude Opus 4 is limited to paid subscribers. Anthropic is placing a strong emphasis on the models' ability to reason and handle complex scenarios with consistency and clarity. Opus 4 scored an impressive 72.5% on the SWE-bench and 43.2% on the Terminal-bench — both respected benchmarks for coding tasks. These scores reflect the model's capacity to maintain high performance over extended periods, an essential trait for long-duration software projects and AI agent functions. While Sonnet 4 isn't as powerful as Opus, it performs commendably with a 72.7% score on SWE-bench. According to Anthropic, the Sonnet model delivers a strong balance between speed and accuracy, making it suitable for users who need efficiency without sacrificing output quality. One standout feature across both models is their ability to engage in what Anthropic describes as 'extended thinking.' This functionality allows the models to pause their reasoning, use external tools like code execution or web search, and then resume their task, now with added context and depth. Tool usage can also occur in parallel, enhancing productivity in more intricate workflows. The Claude 4 models are also designed to be remembered. If granted access to local files, they can extract important facts, store them, and recall them in future interactions — a big step forward in developing AI with useful long-term memory. Anthropic showcased this by having Opus 4 access files during a game of Pokémon, where it successfully created a 'Navigation Guide' while keeping track of its prior actions. Anthropic has also rolled out four new API features alongside the model launch: a code execution tool, a connector for Multi-Component Programs (MCP), a Files API, and one-hour prompt caching. These updates aim to make it easier for developers to create more responsive and intelligent AI agents. To further demystify AI decision-making, the company has introduced a new 'thinking summaries' feature. Generated by a smaller AI model, these provide short, digestible explanations of the model's reasoning. For those wanting a deeper look, full reasoning chains are still available in Developer Mode. With these launches, Anthropic is not just expanding its model lineup — it's staking a bold claim in the future of AI development, focusing on deeper reasoning, better tools, and models that think more like humans.

Anthropic launches Claude Opus 4 and Claude Sonnet 4 AI models

Time of India

23-05-2025

Business
Time of India

Anthropic launches Claude Opus 4 and Claude Sonnet 4 AI models

Anthropic has launched its next-generation AI models, Claude Opus 4 and Claude Sonnet 4 , positioning them as major advancements in AI coding, reasoning, and agentic capabilities. The release is part of the broader Claude 4 update , which also includes new developer tools, improved memory functions, and enhanced agent workflows. Both models introduce 'extended thinking with tool use' in beta, allowing Claude to alternate between internal reasoning and external tools like web search. They also support parallel tool execution, improved memory handling, and enhanced instruction-following. When granted file access, Opus 4 can create and reference memory files, increasing contextual understanding and long-term coherence. Claude Opus 4 Claude Opus 4 is being touted by the company as the most powerful coding model to date with benchmarks like SWE-bench (72.5%) and Terminal-bench (43.2%). The AI model is said to be capable of sustained performance across complex, multi-step tasks. The model powers long-running workflows and supports agent applications that require persistent focus and reasoning. Claude Sonnet 4 Claude Sonnet 4, on the other hand, is said to be a substantial upgrade from Sonnet 3.7, balancing high performance with efficiency. It is claimed to achieve a 72.7% score on SWE-bench and is designed for both internal use and third-party deployment. The model has already been adopted by GitHub as the core of its new Copilot coding agent. Your iPhone's NEW Home is India: Apple's new Manufacturing HUB! Both models are accessible via the Claude Pro, Max, Team, and Enterprise plans, with Sonnet 4 also available to free-tier users. Pricing remains unchanged from the previous generation. The Claude 4 launch is available now across the Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI. Anthropic notes that both models are 65% less likely to take shortcuts in agentic tasks, compared to Sonnet 3.7. They've also introduced 'thinking summaries' to condense longer reasoning chains for easier interpretation, while offering a Developer Mode for advanced users who need full process transparency. Anthropic has rolled out four new capabilities on its API: a code execution tool, an MCP connector, a Files API, and a prompt caching function. These tools aim to help developers build more robust AI-driven applications and agents.

Anthropic announces Claude 4 Sonnet and Claude 4 Opus AI models, says they reason hard

India Today

23-05-2025

Business
India Today

Anthropic announces Claude 4 Sonnet and Claude 4 Opus AI models, says they reason hard

Anthropic has launched two new AI models: Claude Opus 4 and Claude Sonnet 4. Anthropic says that these models are the best in the industry, and their highlight is their ability to reason. Along with that, Opus 4 and Sonnet 4 are both designed to improve coding and agent-like tasks. According to Anthropic, Claude Opus 4 is its most powerful model to date and is aimed at developers working on long and complex tasks. 'Claude Opus 4 is the world's best coding model, with sustained performance on complex, long-running tasks and agent workflows,' Anthropic writes on its blog Sonnet 4, meanwhile, is described as a more practical, efficient upgrade from its predecessor, Claude Sonnet 3.7, and is now available to free-tier users. Opus 4 will only be available to paid subscribers of of the key highlights, according to Anthropic, is Claude Opus 4's strong performance on coding benchmarks. The AI company claims that it scored 72.5 per cent on SWE-bench and 43.2 per cent on Terminal-bench. Which basically means that the model can reportedly work for several hours at a time without dropping performance, making it suitable for projects that require sustained attention. Claude Sonnet 4 also shows improvement over previous versions, scoring 72.7 per cent on SWE-bench. While it doesn't match Opus 4 in overall capability, Anthropic says it strikes a better balance between speed and accuracy, which makes it suitable for broader, more everyday models come with what Anthropic calls 'extended thinking' and tool use. This means they can pause reasoning, use tools like web search or code execution, and then resume their thought process. Tool use can now happen in parallel as well, which helps with more complex models also introduce new memory features. If given access to local files, they can extract key facts and save them for future use. Anthropic says this helps the model build better long-terms memory and improve performance on tasks that require has also announced four new API features, which are also rolling out. This includes a code execution tool, a connector for Anthropic's Multi-Component Programs (MCP), a Files API, and prompt caching for up to one hour. These updates aim to make it easier for developers to build AI agents that can take on more complex tasks. Anthropic claims that Claude Opus 4 shows strong memory performance, particularly in agent-like settings. Anthropic says it gave the AI model file access during a game of Pokmon, and it was able to create a 'Navigation Guide' and maintain awareness of past actions. It also reduces the chance of using shortcuts or loopholes to complete tasks, something previous models were more prone to help users better understand how the models arrive at conclusions, Anthropic has added a new feature called 'thinking summaries.' These are short overviews of the model's reasoning process, generated by a smaller AI model. Full chains of thought are still available upon request through a Developer Mode.

OpenAI Shifts Focus With GPT-4.1, Prioritizes Coding And Cost Efficiency

Forbes

15-04-2025

Business
Forbes

OpenAI Shifts Focus With GPT-4.1, Prioritizes Coding And Cost Efficiency

OpenAI OpenAI launched its GPT-4.1 family of AI models focusing on enhancing developer productivity through improved coding, long-context handling and instruction-following capabilities available directly via its application programming interface. The release includes three distinct models, GPT-4.1, GPT-4.1 mini and GPT-4.1 nano, signaling a move toward task-specific optimizations within the large language model landscape. These models are not immediately replacing user-facing interfaces like ChatGPT but are positioned as tools for developers building applications and services. For technology leaders and business decision makers, this release warrants attention. It indicates a strategic direction toward more specialized and potentially more cost-effective large language models optimized for enterprise functions, particularly software development complex data analysis and the creation of autonomous AI agents. The availability of tiered models and improved performance metrics could influence decisions around AI integration build-versus-buy strategies and allocating resources for internal development tools, potentially altering established development cycles. Technically, the GPT-4.1 series represents an incremental but focused upgrade over its predecessor GPT-4o. A significant enhancement is the expansion of the context window to support up to 1 million tokens. This is a substantial increase from the 128000 token capacity of GPT-4o, allowing the models to process and maintain coherence across much larger volumes of information equivalent to roughly 750000 words. This capability directly addresses use cases involving the analysis of extensive codebases, the summarization of lengthy documents, or maintaining context in prolonged complex interactions necessary for sophisticated AI agents. The models operate with refreshed knowledge, incorporating information up to June 2024. OpenAI reports improvements in core competencies relevant to developers. Internal benchmarks suggest GPT-4.1 shows a measurable improvement in coding tasks compared to both GPT-4o and the earlier GPT-4.5 preview model. Performance on benchmarks like SWE-bench, which measures the ability to resolve real-world software engineering issues, showed GPT-4.1 achieving a 55% success rate, according to OpenAI. The models are also trained to follow instructions more literally, which requires careful and specific prompting but allows for greater control over the output. The tiered structure offers flexibility: the standard GPT-4.1 provides the highest capability while the mini and nano versions offer balances between performance speed and reduced operational cost, with nano being positioned as the fastest and lowest-cost option suitable for tasks like classification or autocompletion. In the broader market context, the GPT-4.1 release intensifies competition among leading AI labs. Providers like Google with its Gemini series and Anthropic with its Claude models have also introduced models boasting million-token context windows and strong coding capabilities. This reflects an industry trend moving beyond general-purpose models toward variants optimized for specific high-value tasks often driven by enterprise demand. OpenAI's partnership with Microsoft is evident with GPT-4.1 models being made available through Microsoft Azure OpenAI Service and integrated into developer tools like GitHub Copilot and GitHub Models. Concurrently, OpenAI announced plans to retire API access to its GPT-4.5 preview model by mid-July 2025, positioning the new 4.1 series as offering comparable or better performance at a lower cost. OpenAI's GPT-4.1 series introduces a significant reduction in API pricing compared to its predecessor, GPT-4o, making advanced AI capabilities more accessible to developers and enterprises. Pricing Comparison This pricing strategy positions GPT-4.1 as a more cost-effective solution, offering up to 80% savings per query compared to GPT-4o, while also delivering enhanced performance and faster response times. The tiered model approach allows developers to select the appropriate balance between performance and cost, with GPT-4.1 Nano being ideal for tasks like classification or autocompletion, and the standard GPT-4.1 model suited for more complex applications. From a strategic perspective, the GPT-4.1 family presents several implications for businesses. The improved coding and long-context capabilities could accelerate software development cycles, enabling developers to tackle more complex problems, analyze legacy code more effectively, or generate code documentation and tests more efficiently. The potential for building more sophisticated internal AI agents capable of handling multi-step tasks with access to large internal knowledge bases increases. Cost efficiency is another factor; OpenAI claims the 4.1 series operates at a lower cost than GPT-4.5 and has increased prompt caching discounts for users processing repetitive context. Furthermore, the upcoming availability of fine-tuning for the 4.1 and 4.1-mini models on platforms like Azure will allow organizations to customize these models using their own data for specific domain terminology workflows or brand voice, potentially offering a competitive advantage. However, potential adopters should consider certain factors. The enhanced literalness in instruction-following means prompt engineering becomes even more critical, requiring clarity and precision to achieve desired outcomes. While the million-token context window is impressive, OpenAI's data suggests that model accuracy can decrease when processing information at the extreme end of that scale, indicating a need for testing and validation for specific long-context use cases. Integrating and managing these API-based models effectively within existing enterprise architectures and security frameworks also requires careful planning and technical expertise. This release from OpenAI underscores the rapid iteration cycles in the AI space, demanding continuous evaluation of model capabilities, cost structures and alignment with business objectives.

AI Still Struggles to Debug Code, But for How Long?

Yahoo

12-04-2025

Business
Yahoo

AI Still Struggles to Debug Code, But for How Long?

PCMag editors select and review products independently. If you buy through affiliate links, we may earn commissions, which help support our testing. If you're a programmer who is scared about AI taking your job, Microsoft's R&D division might have some promising news for you. Microsoft Research tested several top large language models (LLMs) and found that many come up short on common programming tasks. The study tested nine different models—including Anthropic's Claude 3.7 Sonnet, OpenAI's o1, and OpenAI's o3-mini—and assessed their ability to perform 'debugging,' the time-consuming process whereby programmers sift through existing code to find flaws that prevent it from working as intended. Microsoft hooked up the AIs to a third-party debugging assistant it created called Debug Gym and tested the AIs on a common software benchmark, SWE-bench. The study had mixed results, and none of the tools achieved even a 50% success rate, even with the help of Debug Gym. Anthropic's Claude 3.7 Sonnet was the best performer, managing to successfully debug the faulty code in 48.4% of cases. OpenAI's o1 achieved success 30.2% of the time, while OpenAI's o3-mini did so 22.1% of the time. Microsoft says it believes the AI tools can become effective code debuggers, but it needs "to fine-tune an info-seeking model specialized in gathering the necessary information to resolve bugs." The findings may provide some slight relief for worried programmers, as more of the tech world's largest names pivot toward using AI for coding. In October, Google announced it was using AI to write "a quarter of all new code." Meanwhile, AI startup Cognition Labs rolled out a new AI tool last year, dubbed Devin AI, that it claims can write code without human interference, complete engineering jobs on Upwork, and adjust its own AI models. Meta CEO Mark Zuckerberg, meanwhile, told podcaster Joe Rogan that his company will "have an AI that can effectively be a sort of mid-level engineer that you have at your company that can write code" at some point in 2025, and he expects other companies will do the same.

Latest news with #SWE-bench

Anthropic Launches Claude 4 Sonnet and Opus: Smarter AI Models for Coding and Complex Reasoning

Anthropic launches Claude Opus 4 and Claude Sonnet 4 AI models

Anthropic announces Claude 4 Sonnet and Claude 4 Opus AI models, says they reason hard

OpenAI Shifts Focus With GPT-4.1, Prioritizes Coding And Cost Efficiency

AI Still Struggles to Debug Code, But for How Long?

Get Started Now: Download the App