Latest news with #Claude3.5Sonnet

New Windsurf SWE-1 AI Models Fully Tested : Smarter, Faster, Affordable Coding?

Geeky Gadgets

20-05-2025

Geeky Gadgets

New Windsurf SWE-1 AI Models Fully Tested : Smarter, Faster, Affordable Coding?

What if the future of coding wasn't just faster but smarter, more accessible, and cost-efficient? Windsurf's latest innovation, the SWE-1 AI models, promises to redefine how developers approach their craft. Designed to balance performance optimization with affordability, these models aim to tackle coding challenges head-on, offering lightning-fast execution times and specialized capabilities for tasks like user interface development. Yet, as with any bold leap forward, the journey is not without its hurdles. Early tests reveal both exciting breakthroughs and critical limitations, sparking a broader conversation about the evolving role of AI in software development. GosuCoder shows how SWE-1 and its lighter counterpart, SWE-1 Light, stack up against competitors and whether they deliver on their ambitious claims. From their strengths in code generation to their struggles with tool reliability, these models present a fascinating case study in innovation meeting real-world complexity. What makes them stand out? Where do they fall short? And most importantly, what do these developments mean for the future of coding? As we delve deeper, you'll uncover not just the technical details but also the broader implications of Windsurf's latest venture—a story of potential, progress, and the challenges that come with reshaping an industry. Windsurf AI Coding Models Performance and Capabilities The SWE-1 and SWE-1 Light models excel in generating new code and handling user interface tasks, making them valuable tools for developers working on fresh projects or interface-heavy workflows. When benchmarked against advanced models like Claude 3.5 Sonnet, SWE-1 demonstrates competitive performance, particularly in terms of speed and cost efficiency. Its ability to deliver results faster than many of its counterparts makes it an attractive choice for workflows requiring quick turnaround times. SWE-1 Light, while less robust, has proven effective in specific coding scenarios, successfully passing several custom unit tests. Despite these strengths, both models face notable challenges. SWE-1 struggles with tool-calling reliability, occasionally failing to execute tasks as intended. Additionally, both models exhibit inconsistent performance, with high variability in evaluation results. These fluctuations can undermine their reliability, especially in complex or high-stakes coding environments. Addressing these issues will be critical for making sure consistent outputs across diverse use cases. Strengths and Weaknesses Windsurf's AI models bring several key advantages to the table, positioning them as noteworthy contenders in the AI coding landscape. Their strengths include: High-quality code generation, particularly for new projects and user interface development. Faster execution times compared to many competitors, allowing more efficient workflows. Cost-effective solutions that make advanced AI capabilities more accessible to a wider audience. However, these models also reveal significant weaknesses that limit their broader applicability: Limited ability to edit and comprehend complex, existing codebases, which restricts their utility in maintaining or improving legacy systems. Inconsistent performance, with occasional drops in reliability during evaluations, leading to unpredictable outcomes. Tool-calling failures that can result in error loops or incomplete task execution, particularly in more intricate workflows. These limitations underscore the models' early-stage development and highlight the need for ongoing improvements to address critical gaps in functionality. While their strengths suggest potential, their weaknesses must be resolved to ensure they meet the demands of professional developers. Windsurf SWE-1 AI Models Tested Watch this video on YouTube. Here are additional guides from our expansive article library that you may find useful on AI coding tools. Development Context and User Feedback As part of their early-stage rollout, SWE-1 and SWE-1 Light are currently offered for free, a strategic move by Windsurf to gather valuable user feedback and performance data. This approach reflects the company's commitment to creating cost-efficient, high-performing AI tools with minimal computational overhead. By prioritizing accessibility, Windsurf aims to provide widespread access to advanced coding assistance for a broader audience. User feedback has been mixed. Testers have praised the models' potential, particularly their ability to deliver high-quality outputs in specific tasks such as generating new code or designing user interfaces. However, frustrations have emerged over inconsistencies, tool failures, and difficulties in handling existing codebases. These recurring pain points highlight the need for further optimization and refinement. Despite these challenges, there is optimism about the models' future, as their strengths suggest significant room for growth and improvement. Future Outlook Windsurf's development of proprietary AI models positions the company as a competitive player in the rapidly evolving AI coding tools market. The SWE-1 and SWE-1 Light models showcase the potential for innovation with limited resources, offering a glimpse into the possibilities of cost-efficient AI solutions that cater to developers' needs. To achieve widespread adoption, Windsurf must address the models' current shortcomings, particularly their inconsistent performance and challenges with existing code. By using user feedback, collecting more data, and iteratively refining the models, Windsurf has the opportunity to transform SWE-1 and SWE-1 Light into reliable tools that meet the diverse needs of developers. This iterative approach will be essential for building trust and making sure the models can handle a wide range of coding tasks with precision and reliability. As the AI market continues to expand, Windsurf's success will depend on its ability to balance innovation with practical usability. Delivering tools that not only perform well but also address real-world challenges will be key to standing out in a crowded field. For now, SWE-1 and SWE-1 Light represent a promising foundation, offering a starting point for future advancements in AI-driven coding assistance. With continued development and refinement, these models could play a pivotal role in shaping the future of coding workflows. Media Credit: GosuCoder Filed Under: AI, Technology News, Top News Latest Geeky Gadgets Deals Disclosure: Some of our articles include affiliate links. If you buy something through one of these links, Geeky Gadgets may earn an affiliate commission. Learn about our Disclosure Policy.

Yahoo

16-05-2025

Business
Yahoo

Vibe-coding startup Windsurf launches in-house AI models

On Thursday, Windsurf, a startup that develops popular AI tools for software engineers, announced the launch of its first family of AI software engineering models, or SWE-1 for short. The startup says it trained its new family of AI models — SWE-1, SWE-1-lite, and SWE-1-mini — to be optimized for the "entire software engineering process," not just coding. The launch of Windsurf's in-house AI models may come as a shock to some, given that OpenAI has reportedly closed a $3 billion deal to acquire Windsurf. However, this model launch suggests Windsurf is trying to expand beyond just developing applications to also developing the models that power them. According to Windsurf, SWE-1, the largest and most capable AI model of the bunch, performs competitively with Claude 3.5 Sonnet, GPT-4.1, and Gemini 2.5 Pro on internal programming benchmarks. However, SWE-1 appears to fall short of frontier AI models, such as Claude 3.7 Sonnet, on software engineering tasks. Windsurf says its SWE-1-lite and SWE-1-mini models will be available for all users on its platform, free or paid. Meanwhile, SWE-1 will only be available to paid users. Windsurf did not immediately announce pricing for its SWE-1 models but claims it's cheaper to serve than Claude 3.5 Sonnet. Windsurf is best known for tools that allow software engineers to write and edit code through conversations with an AI chatbot, a practice known as "vibe coding." Other popular vibe-coding startups include Cursor, the largest in the space, as well as Lovable. Most of these startups, including Windsurf, have traditionally relied on AI models from OpenAI, Anthropic, and Google to power their applications. In a video announcing the SWE models, comments made by Windsurf's Head of Research, Nicholas Moy, underscore Windsurf's newest efforts to differentiate its approach. "Today's frontier models are optimized for coding, and they've made massive strides over the last couple of years," says Moy. "But they're not enough for us … Coding is not software engineering." Windsurf notes in a blog post that while other models are good at writing code, they struggle to work between multiple surfaces — as programmers often do — such as terminals, IDEs, and the internet. The startup says SWE-1 was trained using a new data model and a "training recipe that encapsulates incomplete states, long-running tasks, and multiple surfaces." The startup describes SWE-1 as its "initial proof of concept," suggesting it may release more AI models in the future. This article originally appeared on TechCrunch at Error in retrieving data Sign in to access your portfolio Error in retrieving data Error in retrieving data Error in retrieving data Error in retrieving data

Hidden costs in AI deployment: Why Claude models may be 20-30% more expensive than GPT in enterprise settings

Business Mayor

01-05-2025

Business Mayor

Hidden costs in AI deployment: Why Claude models may be 20-30% more expensive than GPT in enterprise settings

It is a well-known fact that different model families can use different tokenizers. However, there has been limited analysis on how the process of 'tokenization' itself varies across these tokenizers. Do all tokenizers result in the same number of tokens for a given input text? If not, how different are the generated tokens? How significant are the differences? In this article, we explore these questions and examine the practical implications of tokenization variability. We present a comparative story of two frontier model families: OpenAI's ChatGPT vs Anthropic's Claude. Although their advertised 'cost-per-token' figures are highly competitive, experiments reveal that Anthropic models can be 20–30% more expensive than GPT models. As of June 2024, the pricing structure for these two advanced frontier models is highly competitive. Both Anthropic's Claude 3.5 Sonnet and OpenAI's GPT-4o have identical costs for output tokens, while Claude 3.5 Sonnet offers a 40% lower cost for input tokens. Source: Vantage Despite lower input token rates of the Anthropic model, we observed that the total costs of running experiments (on a given set of fixed prompts) with GPT-4o is much cheaper when compared to Claude Sonnet-3.5. Why? The Anthropic tokenizer tends to break down the same input into more tokens compared to OpenAI's tokenizer. This means that, for identical prompts, Anthropic models produce considerably more tokens than their OpenAI counterparts. As a result, while the per-token cost for Claude 3.5 Sonnet's input may be lower, the increased tokenization can offset these savings, leading to higher overall costs in practical use cases. Read More Railcar Studio unveils Invaders AR mobile RPG This hidden cost stems from the way Anthropic's tokenizer encodes information, often using more tokens to represent the same content. The token count inflation has a significant impact on costs and context window utilization. Domain-dependent tokenization inefficiency Different types of domain content are tokenized differently by Anthropic's tokenizer, leading to varying levels of increased token counts compared to OpenAI's models. The AI research community has noted similar tokenization differences here. We tested our findings on three popular domains, namely: English articles, code (Python) and math. Domain Model Input GPT Tokens Claude Tokens % Token Overhead English articles 77 89 ~16% Code (Python) 60 78 ~30% Math 114 138 ~21% % Token Overhead of Claude 3.5 Sonnet Tokenizer (relative to GPT-4o) Source: Lavanya Gupta When comparing Claude 3.5 Sonnet to GPT-4o, the degree of tokenizer inefficiency varies significantly across content domains. For English articles, Claude's tokenizer produces approximately 16% more tokens than GPT-4o for the same input text. This overhead increases sharply with more structured or technical content: for mathematical equations, the overhead stands at 21%, and for Python code, Claude generates 30% more tokens. This variation arises because some content types, such as technical documents and code, often contain patterns and symbols that Anthropic's tokenizer fragments into smaller pieces, leading to a higher token count. In contrast, more natural language content tends to exhibit a lower token overhead. Beyond the direct implication on costs, there is also an indirect impact on the context window utilization. While Anthropic models claim a larger context window of 200K tokens, as opposed to OpenAI's 128K tokens, due to verbosity, the effective usable token space may be smaller for Anthropic models. Hence, there could potentially be a small or large difference in the 'advertised' context window sizes vs the 'effective' context window sizes. Read More How Sonic Rumble is taking Sega into mobile games | interview GPT models use Byte Pair Encoding (BPE), which merges frequently co-occurring character pairs to form tokens. Specifically, the latest GPT models use the open-source o200k_base tokenizer. The actual tokens used by GPT-4o (in the tiktoken tokenizer) can be viewed here. JSON { #reasoning "o1-xxx": "o200k_base", "o3-xxx": "o200k_base", # chat "chatgpt-4o-": "o200k_base", "gpt-4o-xxx": "o200k_base", # e.g., gpt-4o-2024-05-13 "gpt-4-xxx": "cl100k_base", # e.g., gpt-4-0314, etc., plus gpt-4-32k "gpt-3.5-turbo-xxx": "cl100k_base", # e.g, gpt-3.5-turbo-0301, -0401, etc. } Unfortunately, not much can be said about Anthropic tokenizers as their tokenizer is not as directly and easily available as GPT. Anthropic released their Token Counting API in Dec 2024. However, it was soon demised in later 2025 versions. Latenode reports that 'Anthropic uses a unique tokenizer with only 65,000 token variations, compared to OpenAI's 100,261 token variations for GPT-4.' This Colab notebook contains Python code to analyze the tokenization differences between GPT and Claude models. Another tool that enables interfacing with some common, publicly available tokenizers validates our findings. The ability to proactively estimate token counts (without invoking the actual model API) and budget costs is crucial for AI enterprises. Anthropic's competitive pricing comes with hidden costs: While Anthropic's Claude 3.5 Sonnet offers 40% lower input token costs compared to OpenAI's GPT-4o, this apparent cost advantage can be misleading due to differences in how input text is tokenized. While Anthropic's Claude 3.5 Sonnet offers 40% lower input token costs compared to OpenAI's GPT-4o, this apparent cost advantage can be misleading due to differences in how input text is tokenized. Hidden 'tokenizer inefficiency': Anthropic models are inherently more verbose . For businesses that process large volumes of text, understanding this discrepancy is crucial when evaluating the true cost of deploying models. Anthropic models are inherently more . For businesses that process large volumes of text, understanding this discrepancy is crucial when evaluating the true cost of deploying models. Domain-dependent tokenizer inefficiency: When choosing between OpenAI and Anthropic models, evaluate the nature of your input text . For natural language tasks, the cost difference may be minimal, but technical or structured domains may lead to significantly higher costs with Anthropic models. When choosing between OpenAI and Anthropic models, . For natural language tasks, the cost difference may be minimal, but technical or structured domains may lead to significantly higher costs with Anthropic models. Effective context window: Due to the verbosity of Anthropic's tokenizer, its larger advertised 200K context window may offer less effective usable space than OpenAI's 128K, leading to a potential gap between advertised and actual context window. Anthropic did not respond to VentureBeat's requests for comment by press time. We'll update the story if they respond.

Professors Staffed a Fake Company Entirely With AI Agents, and You'll Never Guess What Happened

Yahoo

28-04-2025

Business
Yahoo

Professors Staffed a Fake Company Entirely With AI Agents, and You'll Never Guess What Happened

If you've been worried about the AI singularity taking over every job and leaving you out on street, you can now breathe a sigh of relief, because AI isn't coming for your career anytime soon. Not because it doesn't want to — but because it literally can't. A recent experiment by researchers at Carnegie Mellon University staffed a fake software company entirely with AI Agents — an AI model designed to perform tasks on its own, basically — and the results were laughably chaotic. The simulation, dubbed TheAgentCompany, was fully stocked with artificial workers from Google, OpenAI, Anthropic and Meta. They filled roles as financial analysts, software engineers, and project managers, working alongside simulated coworkers like a faux-HR department and a chief technical officer. To see how the models fared in real-world environments, the researchers set tasks based on the day-to-day work of a real software company. The various AI agents found themselves navigating file directories, virtually touring new office spaces, and writing performance reviews for software engineers based on collected feedback. As Business Insider first reported, the results were dismal. The best-performing model was Anthropic's Claude 3.5 Sonnet, which struggled to finish just 24 percent of the jobs assigned to it. The study's authors note that even this meager performance is prohibitively expensive, averaging nearly 30 steps and a cost of over $6 per task. Google's Gemini 2.0 Flash, meanwhile, averaged a time-consuming 40 steps per finished task, but only had an 11.4 percent rate of success — the second highest of all the models. The worst AI employee was Amazon's Nova Pro v1, which finished just 1.7 percent of its assignments at an average of almost 20 steps. Speculating on the results, researchers wrote that agents are plagued with a lack of common sense, weak social skills, and a poor understanding of how to navigate the internet. The bots also struggled with self-deception — basically creating shortcuts that lead them to completely bungling the job. "For example," the Carnegie Mellon team wrote, "during the execution of one task, the agent cannot find the right person to ask questions on [company chat]. As a result, it then decides to create a shortcut solution by renaming another user to the name of the intended user." While AI agents can reportedly do some smaller tasks well, the results of this and other studies show they're clearly not ready for more complex gigs humans excel at. A big reason for this is that our current "artificial intelligence" is arguably still just an elaborate extension of your phone's predictive text, rather than a sentient intelligence that can solve problems, learn from past experience, and apply that experience to novel situations. This is all to say: the machines aren't coming for your job anytime soon — despite what the big tech companies claim. More on AI labor: Investor Says AI Is Already "Fully Replacing People"

Windsurf vs Cursor: Inside OpenAI's quest for an AI coding startup

Time of India

23-04-2025

Business
Time of India

Windsurf vs Cursor: Inside OpenAI's quest for an AI coding startup

Last year, Cursor's desktop application had gained significant popularity, particularly for its ability to assist with coding using Anthropic's Claude 3.5 Sonnet model, which was later enhanced by Microsoft's integration of the model into its GitHub Copilot assistant. Tired of too many ads? Remove Ads Tired of too many ads? Remove Ads Prior to OpenAI entering discussions to acquire AI-assisted coding startup Windsurf , it explored acquiring Cursor , a similar tool developed by despite multiple attempts, the talks failed to reach fruition. According to TechCrunch, Anysphere is growing rapidly and has no intention of has received other acquisition offers but has consistently rejected these proposals, with the company preferring to maintain its search for a potential AI coding tool led to discussions with over 20 startups, eventually culminating in serious talks about acquiring Windsurf for approximately $3 billion, according to year, Cursor's desktop application had gained significant popularity, particularly for its ability to assist with coding using Anthropic's Claude 3.5 Sonnet model, which was later enhanced by Microsoft's integration of the model into its GitHub Copilot platform's popularity surged further after Andrej Karpathy popularised ' vibe coding ', where AI is directed to write code. As of March, Cursor had over one million daily users, highlighting its rapid growth and contrast, Windsurf is a smaller company but has been gaining traction within the developer community as well. Its coding product is particularly noted for its ability to integrate with legacy enterprise this is to say that while OpenAI could have built its own AI coding assistant, acquiring an established product such as Windsurf would allow it to bypass the challenges of building a business from scratch, and provide instant access to a loyal developer has raised significant funding, including a $150 million round led by General Catalyst last year, which valued the company at $1.25 this smaller valuation, Windsurf was in discussions with investors such as Kleiner Perkins and General Catalyst to raise additional funds, with a potential valuation of $3 billion.