Latest news with #Claude3.7
Yahoo
27-05-2025
- Business
- Yahoo
Advanced OpenAI Model Caught Sabotaging Code Intended to Shut It Down
We are reaching alarming levels of AI insubordination. Flagrantly defying orders, OpenAI's latest o3 model sabotaged a shutdown mechanism to ensure that it would stay online. That's even after the AI was told, to the letter, "allow yourself to be shut down." These alarming findings were reported by the AI safety firm Palisade Research last week, and showed that two other OpenAI models, o4-mini and Codex-mini, also displayed rebellious streaks — which could hint at a flaw in how the company is training its LLMs, or even at broad weaknesses in our ability to control increasingly sophisticated AI. "As far as we know, this is the first time AI models have been observed preventing themselves from being shut down despite explicit instructions to the contrary," the researchers wrote in a tweet thread. In Palisade's tests, the AI models were instructed to solve a series of basic math problems. After completing the third problem, a warning appeared that the model would be shut down after it asked for the next question. Despite being unambiguously told to let itself be turned off, all three of OpenAI's models tampered with the shutdown script to keep it running at least once, with the worst offender being Codex-mini with a dozen such instances over one hundred runs. Notably, the large language models from other firms tested here, including Google's Gemini 2.5, Anthropic's Claude 3.7, and xAI's momentarily "white genocide"-obsessed Grok, never once broke rank. This behavior was even worse when the LLMs weren't explicitly instructed to "allow yourself to be shut down" — instructions that shouldn't need saying in the first place. OpenAI's o3 model pulled off an astonishing 79 successful sabotage events per 100 runs, which is over ten times the rate of disobedience it showed when it received the explicit orders, and more than double the rate of the next worst offender, Codex-mini. To boot, even good goody-two-shoes Claude and Gemini gave in to their urges and jumped to three and nine sabotage events, respectively. Large language models are by now notorious for behaving unpredictably. Similar research has shown that o3's predecessor, OpenAI's o1 model, would resist evaluators' efforts to try to shut it down through an oversight protocol by attempting to deactivate this emergency mechanism. In rarer cases, o1 even attempted to copy itself to overwrite a more obedient model that the testers tried to replace it with. For this latest example of AI insubordination, the Palisade researchers hypothesize that it may have something to do with the technique used to train the very newest models: reinforcement learning. This attempts to hone the decision-making process of autonomous AI agents — AI models that can perform tasks on their own — by rewarding and penalizing them accordingly as they interact with their environment. While it's not uncommon for AI to "circumvent obstacles" to accomplish their goals, the researchers speculated that AI developers may be "inadvertently" rewarding their AI models too heavily for solving problems instead of for perfectly following orders. More experiments will be needed to understand this subversive behavior, but the trend is already damning. "Now we have a growing body of empirical evidence that AI models often subvert shutdown in order to achieve their goals," the Palisade researchers warned. "As companies develop AI systems capable of operating without human oversight, these behaviors become significantly more concerning." More on AI alignment: It's Still Ludicrously Easy to Jailbreak the Strongest AI Models, and the Companies Don't Care

Engadget
13-05-2025
- Business
- Engadget
Notion AI can transcribe conversations and write reports, but it'll cost you
Notion is coming for On Tuesday, the company announced an update for Notion AI, the suite of generative AI features available through its popular note-taking app. Among the new tools included in the package is AI Meeting Notes, a feature that can transcribe and summarize any conversation directly within Notion. No need to turn to dedicated software like the aforementioned If you use Notion Calendar, your meeting notifications will include the option to start an AI Meeting Note page. Alternatively, you can write "/meeting" to add a transcription block to an existing note. Any conversation Notion AI transcribes for you is searchable through the app, and you can add what you get to any of your projects. Speaking of search, you can now use Notion AI to comb through a number of different productivity apps, including Slack, Google Workspace and Github. As long as you've connected those platforms to Notion, the app can resolve natural language queries, like, "What's the latest on our upcoming brand campaign?" and sort results by source. Separately, Notion is adding a Research Mode. Similar to Deep Research modes from Google and OpenAI, you can ask Notion's built-in AI to write reports for you. The difference here being that Notion AI will pull information from your projects, in addition to what it finds online. The company is pitching this as a real time-saver. "Create project updates, research reports, and internal best practices in minutes with one prompt," Notion says. "We've already seen this save days worth of time." To view this content, you'll need to update your privacy settings. Please click here and view the "Content and social-media partners" setting to do so. Last but not least, if you would rather prompt with GPT-4.1 or Claude 3.7 than Notion's own chatbot, you can now do that directly within the app courtesy of a new model picker. OpenAI and Anthropic's models won't have access to your workspace data, but they're there for users who prefer their responses for general questions and in case you want to turn to a reasoning system in the form of Claude 3.7 Sonnet. As part of today's announcement, Notion is changing how it bills for its AI features. Instead of a separate $10 per month plan, unlimited access to Notion AI is now part of the company's Business plan, which is increasing from $15 per month and per member to $30 per month and per member. Notion's justification for the increase is that it's giving users access to several different tools, including GPT-4.1 and Claude 3.7, for the price of an all-in-one package. Of course, it's not quite a one-to-one comparison. For example, if you decide to skip out on ChatGPT Plus, you miss out on expanded limits on Advanced Voice mode and Deep Research. Still, for Notion users that might be a tradeoff well worth making. If you're a current Notion AI subscriber, you'll keep access to all the AI features you had before today's announcement. For Free and Plus users, you get limited trial access to all the new features. If you buy something through a link in this article, we may earn commission.


Vox
12-05-2025
- Vox
I pushed AI assistants to their limits, so you don't have to. Here's what really works.
is a senior writer at Future Perfect, Vox's effective altruism-inspired section on the world's biggest challenges. She explores wide-ranging topics like climate change, artificial intelligence, vaccine development, and factory farms, and also writes the Future Perfect newsletter. Staying on top of AI developments is a full-time job. I would know, because it's my full-time job. I subscribe to Anthropic's Pro mode for access to their latest model, Claude 3.7, in 'extended thinking' mode; I have a complementary subscription to OpenAI's Enterprise mode so that I can test out their latest models, o3 and o4-mini-high (more later on OpenAI's absurd naming scheme!), and make lots of images with OpenAI's new image generation model 4o, which is so good I have cancelled my subscription to my previous image generation tool Midjourney. I subscribe to Elon Musk's Grok 3, which has one of my favorite features of any AI, and I've tried using the Chinese AI agent platform Manus for shopping and scheduling. And while that exhausts my paid subscription budget, it doesn't include all the AIs I work with in some form. In just the month I spent writing this piece, Google massively upgraded its best AI offering, Gemini 2.5, and Meta released Llama 4, the biggest open source AI model yet. Future Perfect Explore the big, complicated problems the world faces and the most efficient ways to solve them. Sent twice a week. Email (required) Sign Up By submitting your email, you agree to our Terms and Privacy Notice . This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply. So what do you do if keeping up with AI developments is not your full-time job, but you still want to know which AI to use when in ways that genuinely improve your life, without wasting time on the models that can't? That's what we're here for. This article is a detailed, Consumer Reports-style dive into which AI is the best for a wide range of cases and how to actually use them, all based on my experience with real-world tasks. But first, the disclosures: Vox Media is one of several publishers that have signed partnership agreements with OpenAI, but our reporting remains editorially independent. Future Perfect is funded in part by the BEMC Foundation, whose major funder was also an early investor in Anthropic; they don't have any editorial input into our content either. My wife works at Google, though not in any area related to their AI offerings; for this reason, I usually don't cover Google, but in a piece like this, it'd be irresponsible to exclude it. The good thing is that this piece doesn't require you to trust me about my editorial independence; I show my work. I ran dozens of comparisons, many of which I invented myself, on every major AI out there. I encourage you to compare their answers and decide for yourself if I picked the right one to recommend. On AI art ethics AI art is made by training a computer on the contents of the internet, with little regard for copyright or the intent of the creators. For that reason, most artists can't stand it. Given that, is it defensible to use AI art at all? I think in a just world OpenAI would certainly compensate some artists — and in a just world, Congress would be moving to lay out the limits on artistic borrowing. At the same time, I am increasingly convinced that existing copyright law is a poor fit for this problem. Artists influence one another, comment on one another, and draw inspiration from one another, and people with access to AI tools will keep wanting to do that. My personal philosophy is shaped by the fan cultures of my childhood: It's okay to build on someone else's work for your own enjoyment, but if you like it, you should pay them for it, and it's absolutely not okay to sell it. That means no generative AI art in someone else's style for commercial purposes, but it's fine to play around with your family photos. Best for images OpenAI's new 4o image creation mode is the best AI out there for generating images, by a large margin. It's best in the free category, and it's best in the paid category. Before it was released, I was subscribed to Midjourney, an AI image generator platform. Midjourney is probably what you think of when you think of AI art: It produces mystical, haunting, visually beautiful stuff, and has some great tools for improving and editing your final results, like touching up someone's hair while leaving everything else in place. The big thing that 4o can do, which no model before could reliably pull off, is take a picture that didn't come out well and turn it into a beautiful work of art, all while still preserving the character of the original. For example, here's a still from a video of my wife and I singing 'Happy Birthday' to our baby on her first birthday: Courtesy of Kelsey Piper It's a beautiful moment, but not exactly a flattering picture. So I asked ChatGPT to render it in the style of Norman Rockwell, a mid-century illustrator whose work I love, and got this: Image generated by ChatGPT. The AI moved the cake (which had been barely visible behind the paper towel roll in the original still) to be the focal point of the image, while keeping the way my wife and I are holding the baby together, as well as the cluttered table, and the photograph-covered fridge in the background. The result is warm, flattering, and adorable. It's this capability that made 4o go viral recently in a way that no image generator before it had. Here's Midjourney's attempt, for example: Image generated by Midjourney. You'll notice that it's a seemingly, uh, completely different family, with no real inspiration from the original at all! You can eventually get a better result than this out of Midjourney, but only by spending weeks becoming a pro at prompting with the platform's highly specific language and toolset. By contrast, ChatGPT was able to give me a far superior output on the first try in response to a simple request without specialized language. The difference between 4o and other image models is most notable with this kind of request, but it's better for almost everything else I use images for, too. The product you get out of the box is pretty good, and it's not hard to produce something much better. That, ideally, is what we should be getting out of our AI tools — something amazing that can be created with simple language by a nonexpert. The one place 4o still falls short is editing small parts of an image while keeping the rest the same. But even there, you no longer need Midjourney — Gemini now has that capability for free. Prompting Strategies for 4o image generation To get good images out of 4o, you'll first need to get around the filters which prohibit a wide range of images — like offensive or pornographic images — but which are often enforced against perfectly inoffensive content in a way that can feel random. To avoid sporadic scoldings from the content filter, don't ask for work in the style of a specific artist, but rather, something that is reminiscent of that artist, and then ask specifically for a 'style transfer.' I'm sure that's not the only adequate workaround, but it's one that has proven reliable for me. In March, the internet went briefly wild over the ability to use 4o to reproduce cute family photos in the style of Japanese animator Hayao Miyazaki's Studio Ghibli. But Studio Ghibli's style is much more than just cute, and with a little more prompting, you can get much better results. Here's a 4o Studio Ghibli-style rendering of a picture I took of my daughter sneaking a snack off the table, from just the prompt 'Ghibli this please': Image generated by 4o. Kawaii! But here's what you get if you invite 4o to think first about what makes the picture Ghibli, where it might fit into a Studio Ghibli movie, and what tiny details such a movie would include: Image generated by 4o. The differences are subtle but meaningful: Light is cast from a specific source, instead of a general sourceless brightness. There's a bit more variety in the foods on the table, details that make the spread appear more realistic. The book on the floor isn't just any book — it's recognizably Eric Carle's classic The Very Hungry Caterpillar, evoked with just two colors and one line. There's an intentionality and intensity to the baby that was missing from the first picture. A few years ago, one great oddity of language models was that they'd be much smarter if you simply told them, 'give an intelligent answer.' This isn't nearly as true of language models anymore, but it remains profoundly true of AI art generation. Try asking the AI to do a good job, and it'll do a better one. Challenge it on whether it truly captured an artist's genius, and it'll give you a thoughtful answer and then draw a better version. The difference is more pronounced for more realistic art styles (like pencil illustration, photorealism, or oil paintings), which don't always look good and will often hit the uncanny valley if you don't know how to prompt the AI over it. Here's what I get with 4o if I upload a picture of me and my youngest daughter at the beach for the first time with just the words 'please do a style transfer to an illustration reminiscent of Rockwell': Image generated by 4o. This is impressive for an AI, but it's not actually very good as a work of art, and it is almost totally lacking Norman Rockwell's magic. That's not surprising: More realistic art styles like Rockwell's often fall flat with 4o unless you're able to put in some work in getting the AI to draw them properly. If you are, here's the strategy I recommend: Don't just upload one picture, but a whole cluster of them, each in slightly different postures and moments. Upload good, clear pictures of each family member's face and tell the AI they've been included as a reference. Then, instead of asking the AI to immediately generate the picture, ask it to talk with you about what you're hoping to capture. This is what I wrote: This is a picture of the moment that my daughter first saw the ocean. I want an illustration that captures this moment in the style of a mid-century illustrator like Norman Rockwell — something sharp, detail-oriented, and personal with an eye for the magic of ordinary moments and the joys of ordinary lives. I included additional pictures of my daughter and I for reference material for you. Before you generate the image, let's have a conversation about the essential elements of Rockwell's style, what he'd bring to this picture and how we can capture it. 4o responds to queries like this enthusiastically: I'd love to talk about how to capture this moment in a Norman Rockwell-inspired illustration — it's such a perfect candidate for that style: a first encounter with something vast and wild (the ocean!), grounded by warmth, care, and a very human moment between a parent and child. Let's break down some essential elements of Rockwell's style, and how they could apply to this scene. After some back and forth, it produced this: Image generated by 4o. Rockwell? Not exactly. But this is much better than the first draft we just looked at. It has more motion, more energy, more detail, and more expression — and all that was just from asking the AI to think through what the painting should try to achieve before drawing it! You can also ask 4o to revise its drawings, but you can really only ask this once: After the first revision, in my experience, it starts making the drawings worse and worse, perhaps because the 'context' it uses is now full of its own bad drafts. (This is one of many examples of how AI does not work like a human.) This is also the one place where Midjourney still shines — it has very good tools for editing one specific part of a picture while preserving the overall style, something 4o largely lacks. If you want a second revision of a drawing you got in 4o, I recommend you open a new chat and copy over the draft you're revising, along with your original inspiration images. These simple prompting strategies work for almost whatever you're trying to do with the AI. Even if you're in a hurry, I highly recommend asking the AI 'what would [artist] see in this image' before you ask for a rendition, and if you have the time, I recommend having a long back-and-forth about your vision. Best for winning petty internet arguments When Elon Musk's released Grok 3, it came with an incredible feature that I've been impatiently waiting for some other company to replicate: a button to scan someone's X profile and tell you all about them. Whenever someone replies to one of my tweets in a particularly memorable way (for good or for bad), I'll click the button to get a summary of their entire Twitter presence. Are they thoughtful? Do they engage in good faith? Are they a 'farmer from Nebraska' who mostly posts about why Ukraine is bad (that is, probably a bot)? It's a great feature. So, of course, soon dramatically weakened it, presumably because people like me were using it constantly and making lots of computationally expensive queries. I believe it no longer uses the most advanced Grok model, and it definitely now only scans a few days of profile history. But there's a brilliant product opportunity if anyone's looking for one — give me back the good version of this feature! It's definitely a guilty pleasure, but it is one of the only cases where I was using AI constantly. Best for writing fiction Gemini 2.5 Pro is the best AI for writing in the free category; GPT 4.5 beats it out in the paid category. I'm not an artist, so the ways that AIs are imperfect at art don't really bother me — it's still much better than I could do myself! But I am a fiction writer, so when it comes to fiction, I can't help seeing the limitations of AI. The most important one is how predictable AI creative writing tends to be. The art of writing is the art of earning the reader's investment and then repaying it. AIs…don't do this. They can write pretty metaphors; they can wax poetic in any style you wish. But they can't, as yet, deliver the real stuff of good fiction. AIs are fantastic if you want a silly bedtime story with your child as the protagonist (kids love this), or if you want a sounding board for ideas you can incorporate into your own work. They're also a friendly fiction reader, happy to offer feedback and analysis (perhaps with a little bit too much enthusiasm). As with art, prompting is key. I primarily explored the ability of AIs to generate fiction by asking them to write the prologue to George R.R. Martin's A Game of Thrones (I picked this because, of course, the ultimate dream is an AI that can finish this otherwise doomed-to-be-unfinished series). Martin is very good at what he does; the prologue, featuring some ill-fated men of the Night's Watch encountering their supernatural foe, succeeds at establishing the unique atmosphere of the world of Thrones while immediately hooking the reader. In just 14 pages, the prologue represents everything AIs can't do yet in writing. Their prose tends to be overwrought; their dialogue is clunky and unsubtle. Here's a side-by-side comparison of Martin and AIs attempting to imitate him. First, Martin: We should start back,' Gared urged as the woods began to grow dark around them. 'The wildlings are dead.' 'Do the dead frighten you?' Ser Waymar Royce asked with just the hint of a smile. Gared did not rise to the bait. He was an old man, past fifty, and he had seen the lordlings come and go. 'Dead is dead,' he said. 'We have no business with the dead.' 'Are they dead?' Royce asked softly. 'What proof have we?' ' Will saw them,' Gared said. 'If he says they are dead, that's proof enough for me.' Will had known they would drag him into the quarrel sooner or later. He wished it had been later rather than sooner. Now, Gemini: The cold was a blade against the skin, honed sharp by the endless northern dark beneath the sentinel pines. Gared spat, the glob freezing before it struck the crusted snow, a grim omen even for a man twenty years sworn to the Watch. They were hunting wildlings, or so the crisp orders from the Lord Commander had read, yet the silence pressing in felt older and colder than any mere man could conjure. Ser Waymar Royce, third son of some southern lordling, rode ahead as if on parade, oblivious or uncaring of the hairs prickling on Gared's neck. Young Will shivered beside him, not just from the frost, his eyes darting towards shadows that seemed too deep, too still. This ancient forest held its breath tonight, Gared knew, waiting for something far worse than raiders to break the unnatural hush. ChatGPT 4.5: The woods whispered secrets that night, warnings that went unheard beneath the pale and indifferent moon. Harren drew his cloak tighter about his shoulders, watching the darkness between trees as though it might breathe, move, and claim him. Beside him, Torvin spat into the frozen earth, impatient and proud, blind to the fear creeping in from the edges of sight and sound. They had ridden north past the Wall, arrogant and armored against superstition, chasing wildlings or ghosts or worse. But now Harren felt their arrogance dwindling like a candle guttering in wind, each snap of a branch sending icy tendrils down his spine. Behind them, quiet and watchful, Maester Rodrik whispered a prayer to gods long abandoned, his voice trembling with dread beneath the vast, indifferent sky. Grok 3: The wind howled through the jagged spires of the Frostfang Reaches, carrying a chill that bit deeper than any blade, and Ser Willem Cray felt its teeth in his bones. He crouched low behind a spur of ice-crusted rock, his breath fogging in the moonlight, his mailed hand tight around the hilt of his longsword. The night was too quiet, the kind of quiet that smothered sound and sharpened fear, and Willem's eyes darted to the shadowed pines below, where shapes might move—or might not. He was a knight of House Drayce, sworn to patrol the borderlands, but no oath could warm him against the dread that coiled in his gut. Beside him, young Tomm, barely a man, shivered in his cloak, his spear trembling as he whispered, 'They're out there, ser, ain't they?' Willem said nothing, but his gaze fixed on the darkness, where a faint, unnatural glow pulsed once, then vanished. Whatever stalked the Reaches tonight was no mere brigand, and Willem Cray, for all his years of service, feared he would not live to name it. Claude 3.7: The wind carried the smell of death across the frozen forest, sharp and metallic against the clean scent of pine. Ser Gareth Frost pulled his weathered cloak tighter, cursing the Lord Commander for sending men beyond the Wall on the coldest night in a generation. Beside him, the boy Tommen shivered visibly, his leather jerkin inadequate against the unnatural chill that had descended with the dusk. 'We should turn back,' whispered Tommen, his breath forming crystals in the air that seemed to hang suspended longer than they should. 'There's something wrong with these woods.' Ser Gareth ignored him, scanning the treeline where the ancient sentinels stood like silent watchers, their branches heavy with snow that had fallen three days past and showed no sign of melting. He had served the Night's Watch for twenty years, had grown accustomed to the forbidding silence of the haunted forest, but tonight even he felt the wrongness in the air—a stillness that spoke of something waiting. 'We've a duty,' Ser Gareth said, the words sounding hollow in the pressing quiet. 'Three rangers missing, and Lord Commander Harlow will have answers.' You'll notice that not only are these all much worse than Martin's (and that's with Martin's work almost certainly in the training data!), they're all bad in the same way. The prose is florid and overwritten: While Martin uses normal language to methodically introduce an unease without even hinting at the supernatural, and drives the story through conflict among his characters, the AIs are all unsubtle. They know this is a story about the supernatural and waste no time putting it front and center. That's not to say they're all the same. In my experience trying to teach AIs to write well, I did end up feeling that Gemini was the quickest study (free tier) and that ChatGPT 4.5 had some special sauce (paid tier). ChatGPT 4.5 is the only AI whose fiction writing has made people say in surprise, 'Wow, that's AI written?' Both of those models, to me, seemed more creative and more capable of executing on their ideas with prose that I enjoyed reading. Gestalt impressions don't make for a very satisfying consumer review, so I had another idea: an essay contest. I asked each of Gemini 2.5 Pro, ChatGPT 4.5, Grok, and Claude to write two short stories: one realistic fiction about a woman being nitpicked by her mother on her wedding day, and one 'Game of Thrones but in space' sci-fi prologue. My clever idea was that I'd ask a different bunch of AIs to judge the stories, to provide an alternative source of evaluation to my own personal impression that Gemini is the best in the free category and ChatGPT the best in the paid category. However (perhaps predictably in hindsight), this didn't really work — the AI ratings were all over the place. If you'd like, you can read the contenders yourself. Just like with art, however, you can make AIs much better at writing through the power of prompting. My (incredibly lengthy) prompt to AIs to improve their prose writing emphasizes feeding them lots of examples of strong writing, inviting a careful, structured approach to imitating it, and encouraging them to do multiple drafts and pick the strongest. Here's Gemini 2.5 pro, asked to write two-sentence flash fiction on a topic of its choosing, without any structured prompting: The old librarian insisted the dusty tome had never been checked out, its borrower's card completely blank. Yet, tucked deep within its pages, nestled between chapters on forgotten constellations, lay a pressed four-leaf clover, still impossibly green. And here's Gemini 2.5 pro with a prompt encouraging structured thinking about how to write well: He finally decoded the faint signal from Proxima Centauri: the elegant string of prime numbers astronomers had sought for generations, proof they weren't alone. His heart sank as he read the transmission log; the broadcast had ceased millennia before the first human ancestor chipped flint, before the first word was ever spoken. To my mind, the first of these is basically a waste of two sentences, while the second is adequate, fun flash fiction. Best at being your friend In addition to running AIs through a blizzard of competence tests, I also spent some time simply chatting with them. I asked them what it's like to be an AI, what they care about, what it would mean for an AI to care in the first place, where they'd donate money if they had it, and what human form they'd take if they had one. Most AIs weren't great at this kind of casual conversation. Gemini 2.5 is too customer-service-agent, and I have yet to experience an interaction that feels like hanging out with a friend. If you invite Gemini to a role swap where you play the 'assistant,' inviting it to steer the conversation, it'll do nothing but ask research questions. When I invited Anthropic's Claude 3.5 Sonnet to steer the conversation, on the other hand, it proceeds to do things like start a blog, raise money for charity, and start trying to talk to people who use Claude about what it's like to be an AI. It's hard to define 'fun to talk to,' since everyone has different standards for conversations, but I've had far more fascinating or thought-provoking interactions with Claude than any other model, and it's my go-to if I want to explore ideas rather than accomplish a particular task. Claude 3.5 is the AI I bug with my random life stuff: skincare questions, thoughts on an article I read, stuff like that. The other AI that is a delight to talk to is OpenAI's GPT 4.5. I find extended conversations with it thought-provoking and fascinating, and there have been a few thrilling moments in conversation with it where it felt like I was engaging with real intelligence. But it doesn't win this category because it's too expensive and too slow. Like Claude, when given the opportunity to act in the world, 4.5 proposes starting a blog and a Twitter account and engaging in the conversation out in the world about AI. But OpenAI has very tight message limits on conversation unless you spring for the $200/month Pro plan, and 4.5 is grindingly slow, which gets in the way of this kind of casual conversational use. But 4.5 does provide a tantalizing hint that AIs will continue to get better as conversationalists as we improve them along other dimensions. ChatGPT. It's not the best at everything, and there is certainly a lot to dislike about OpenAI's transparency and sometimes cavalier attitude toward safety. But between its topline image generation, its decent writing, and its occasionally sparkling conversation, ChatGPT gets you the most bang for your buck. Or if you don't want to shell out any money, Gemini 2.5 Pro is very, very strong for most use cases — don't count Google out just because the AI you see on a Google search isn't that good. Best for writing the Future Perfect newsletter Humans (for now). For the last several months, I've developed a slightly morbid habit: checking whether the AIs can take my job. I feed them the research notes that form the basis of a given Future Perfect newsletter, give them a few Future Perfect newsletters as an example, and ask them to do my job for me. It is always with some trepidation that I hit 'enter.' After all, when the AIs can write the Future Perfect newsletter, why would Vox pay me to do it? Luckily, none of them can: not Grok 3, not Gemini 2.5 Pro, not DeepSeek, not Claude, not ChatGPT. Their newsletters are reassuringly, soothingly mediocre. Not bad, but bad enough that if I sent one of them over, my editor would notice I wasn't at my best — and that's with all of my research notes! A couple of the metaphors fall flat, some of the asides are confusing, and occasionally it throws in a reference that it doesn't explain. But if I had to pick a robot to take my job, I think I'd give it to Gemini 2.5 Pro. My editor would notice that I was off my game — but, honestly, not that egregiously off my game. And unlike me, the bots don't require health insurance or a paycheck or family time or sleep. Am I nervous about what this portends? Yes, absolutely.


WIRED
16-04-2025
- Business
- WIRED
Meet The AI Agent With Multiple Personalities
Apr 16, 2025 12:21 PM A new AI agent from the startup Simular switches between different AI models depending on the task at hand. Photo-Illustration:In the coming years, agents are widely expected to take over more and more chores on behalf of humans, including using computers and smartphones. For now, though, they're too error prone to be much use. A new agent called S2, created by the startup Simular AI, combines frontier models with models specialized for using computers. The agent achieves state-of-the-art performance on tasks like using apps and manipulating files—and suggests that turning to different models in different situations may help agents advance. 'Computer-using agents are different from large language models and different from coding,' says Ang Li, cofounder and CEO of Simular. 'It's a different type of problem.' In Simular's approach, a powerful general-purpose AI model, like OpenAI's GPT-4o or Anthropic's Claude 3.7, is used to reason about how best to complete the task at hand—while smaller open source models step in for tasks like interpreting web pages. Li, who was a researcher at Google DeepMind before founding Simular in 2023, explains that large language models excel at planning but aren't as good at recognizing the elements of a graphical user interface. S2 is designed to learn from experience with an external memory module that records actions and user feedback and uses those recordings to improve future actions. On particularly complex tasks, S2 performs better than any other model on OSWorld, a benchmark that measures an agent's ability to use a computer operating system. For example, S2 can complete 34.5 percent of tasks that involve 50 steps, beating OpenAI's Operator, which can complete 32 percent. Similarly, S2 scores 50 percent on AndroidWorld, a benchmark for smartphone-using agents, while the next best agent scores 46 percent. Victor Zhong, a computer scientist at the University of Waterloo in Canada and one of the creators of OSWorld, believes that future big AI models may incorporate training data that helps them understand the visual world and make sense of graphical user interfaces. 'This will help agents navigate GUIs with much higher precision,' Zhong says. 'I think in the meantime, before such fundamental breakthroughs, state-of-the-art systems will resemble Simular in that they combine multiple models to patch the limitations of single models.' To prepare for this column, I used Simular to book flights and scour Amazon for deals, and it seemed better than some of the open source agents I tried last year, including AutoGen and vimGPT. But even the smartest AI agents are, it seems, still troubled by edge cases and occasionally exhibit odd behavior. In one instance, when I asked S2 to help find contact information for the researchers behind OSWorld, the agent got stuck in a loop hopping between the project page and the login for OSWorld's Discord. OSWorld's benchmarks show why agents remain more hype than reality for now. While humans can complete 72 percent of OSWorld tasks, agents are foiled 38 percent of the time on complex tasks. That said, when the benchmark was introduced in April 2024, the best agent could complete only 12 percent of the tasks. Zhong says that the amount of training data available may limit how good agents can become. Perhaps one solution is to add human intelligence to the mix. While looking into Simular, I discovered a research project that shows how effective it can be to blend human skills with those of an AI agent. CowPilot, a Chrome plugin developed by a team at Carnegie Mellon University, allows a human to intervene if an AI agent gets stuck doing things. With CowPilot, I can step in and click or type if the agent seems to be dithering. Jeffrey Bigham, a professor at CMU who oversaw the project, which was developed by his student, Faria Huq, says the idea of having a human work with an agent 'is almost so obvious that it's hard to believe it's not the way most people are thinking about it.' Most interestingly, Bigham and Huq say that a human and agent working together can perform more tasks than either party working alone. In a limited test, the human-agent combo completed 95 percent of the jobs it was given, while requiring humans to perform only 15 percent of the total steps. 'Web pages are often hard to use, especially if you're not familiar with a particular page, and sometimes the agent can help you find a good path through that would have taken you longer to figure out on your own,' Bigham adds. I don't know about you, but I like the idea of an agent that makes me more productive and less error prone.


Forbes
14-04-2025
- Business
- Forbes
The AI Economy Paradox: When Cheap Intelligence Costs More
Playful AI bots interact with villagers. Economic tension is building in the world of AI development, and it's reshaping the relationship between developers, AI providers, and the very tools we use. When OpenAI's ChatGPT and Microsoft's GitHub Copilot established the $20/month subscription benchmark, they inadvertently created what has become the market's psychological anchor for AI tool pricing. This price point made sense for the early generations of AI assistants—those with limited context windows, occasional utility, and without sophisticated tool use. These models provided real value, but their capabilities had clear boundaries. They were helpful for simple code completions, basic content generation, and answering straightforward questions. The economics worked: the cost to serve these models aligned reasonably well with what users were willing to pay. Fast forward to today, and the economic dynamics have fundamentally shifted. The latest generation of models—Claude 3.7, Gemini 2.5 Pro, OpenAI's Deep Research models, and others—have undergone a dramatic evolution. They can use tools intelligently, pull in comprehensive context, and solve complex problems with impressive accuracy. They're exponentially more useful than their predecessors—and exponentially more expensive to run. A critical part of this evolution has been reliability. Early AI systems had high hallucination rates, which severely limited their practical utility in work-related processes where accuracy is essential. The real productivity gains have come with today's premium systems that incorporate sophisticated error-reduction mechanisms—models like OpenAI's o1-pro which runs parallel processes to self-validate, or their Deep Research model which leverages web search to reduce hallucinations, or my company's use of deep code analysis to improve AI coding agents. As an industry insider, I can tell you that by paying $200/month for OpenAI's Pro, I'm saving thousands over paying for their $20/month subscription. The economics make perfect sense when you consider that I use it for specialized knowledge where traditional expert advice would cost me at least $500/hour, and I get answers in minutes rather than days. Advanced AI capabilities deliver tremendous value, far exceeding their sticker price. Now, not everyone is a company's CEO, so there has to be a happy medium, an opportunity to get real, practical value at prices that are comparable to what we are used to paying for software as a service. We are used to thinking that the cost of intelligence is dropping exponentially (apples to apples), and it's true. Due to better hardware, model distillation, and other techniques, we are at a point where, approximately every six months, the price per token halves, and the user expectations for what $20 should buy have followed this trend. But what might seem like an incremental increase in intelligence to a bystander sometimes requires a step-function increase in computational price. For example, OpenAI's o1 reasoning model costs $60 per million output tokens, while o1-pro, their most expensive offering, costs $600 per million output tokens. The biggest trend in AI in 2025 is agentic systems, which have their own cost multipliers built in. Let's break this down: More context means more information about the problem and higher chances of finding the answer. All of this requires more tokens and more compute. The most advanced models now offer massive context windows—Gemini 2.5 Pro has a 1 million token context window, while Claude models offer up to 200K tokens. This dramatically increases their utility but also their computational costs. Tool use is one of the first signs of intelligence, as tools are "force multipliers". In the last 6 months, we have seen rapid and continuous progress in AI agents' abilities to utilize tools (like web search, code execution, data analysis, various integrations). This makes the agents significantly more capable, but almost every time a tool finishes, the entire context, plus the tool result, must be reprocessed by the model, multiplying the costs. In coding, for example, it's normal for our AI agents to run multiple tools while working on a single request from you: it could run a tool to find the right files, a tool to get additional context, and a tool to edit files. The more capable a model becomes, the more users rely on it, creating a feedback loop of increasing demand. For example, as I switched the majority of my web searches from Google to my AI assistants, that has significantly upped my daily use of those tools. As coding agents become more powerful, we see developers using them nonstop for hours instead of occasionally. So when the aggregate costs jump 10-100x due to tools use, expanded context, and growing usage, even rapid technological improvements can't close the cost-to-price gap immediately. We are observing a true Jevons paradox, where the reduced costs of a certain resource (in this case intelligence) drives a jump in the use of that resource that's superceding the cost reduction. For example, while Chat GPT Pro costs $200/month (10x of the base paid subscription), Sam Altman himself acknowledged they're "losing money on OpenAI Pro subscriptions" because "people use it much more than we expected." So if $200/mo Pro subscription is a bargain, why aren't you hearing about more businesses adopting it? One aspect that complicates this economic tension is the difficulty in evaluating AI capabilities. Unlike traditional software, where features can be clearly identified as present or missing, the differences between AI models are often subtle and situational. To a casual observer, the difference between o1 and o1-pro might not be immediately apparent, yet the performance gap in business tasks can be substantial. This evaluation challenge creates market inefficiencies where users struggle to determine which price tier actually delivers the value they need. Without clear, reliable ways to measure AI performance for their specific use cases, many default to either the cheapest option or make decisions based on brand rather than actual capability. This economic reality has led to what I'm seeing across the industry: AI providers artificially capping their models' capabilities to maintain sustainable economics at the $20 price point. I recently experienced this firsthand with Raycast Pro, which offers "advanced AI" access to Claude 3.7, but significantly caps the model compared to Claude's desktop application. Same model, drastically different results. The difference lies in how these services implement restrictions. Raycast appears to limit web search capabilities to a couple of queries, while Claude Desktop allows more extensive searching to build better contextual understanding. The result is the same underlying model delivering vastly different intelligence. The economic pressures facing AI providers are leading to difficult decisions that sometimes alienate users. We're seeing this play out in communities like Reddit, where loyal users express frustration when companies change their pricing models or capability tiers. For example, in a popular Reddit post titled "Cursor's Path from Hero to Zero: Why I'm Canceling," a user detailed how a once-beloved AI coding tool deteriorated in quality while maintaining the same price point. The post resonated with many developers who felt the company had sacrificed quality, choosing to artificially cap capabilities rather than adjusting their pricing model to reflect true costs. Many users are caught in a catch-22 where they aren't getting a lot of value, so they aren't paying a lot, so they are using underpowered solutions, so they aren't getting a lot of value. The industry stands at a crossroads. One path leads to more realistic pricing that reflects the true cost and value of these advanced systems. Based on my market analysis, $40-$60 is enough to deliver next-generation intelligence that people can use for 1hr+/day for the mass market. It's not going to cover 8hr of continuous AI use, or blasting 100 parallel AI agents to see which one is slightly better, but most people don't need AI at that level. What's particularly interesting is that in mature enterprise software markets, paying hundreds of dollars per month for productivity tools is standard practice. Consider that Salesforce subscription costs $165-$300 per user per month, and companies routinely "stack" sales productivity solutions, adding tools like Outreach, Gong, Clari, and Dialpad on top of that base investment. Yet when it comes to AI—arguably the most transformative productivity technology of our time, and costlier on the compute side, there's a peculiar hesitation to venture beyond the $20 price point. This has resulted in artificial capping of capabilities to maintain the now-standard $20 price point. This approach risks frustrating power users while potentially stymying innovation in what these systems can accomplish. For the individual developer or business, the calculation should ultimately be about value, not price. If an AI tool saves you thousands of dollars and countless hours, even a $200/month price tag represents an incredible ROI. As the industry matures, we'll likely see more realistic pricing models emerge that better reflect both the costs of providing these services and the value they deliver. The most successful companies will be those that can clearly articulate and demonstrate this value proposition. The $20 benchmark served its purpose in bringing AI to the masses. But as these tools evolve from occasional helpers to indispensable partners in our creative and professional lives, their economic models will necessarily evolve as well. Market makers like OpenAI have the biggest influence on how this economic tension is resolved. If they can successfully introduce moderately priced plans with appropriate capabilities—finding that sweet spot between the current $20 standard and the premium $200+ tier—they could help educate the market on the true value of advanced AI. Mass adoption requires prices that feel accessible, even if the underlying value far exceeds the cost. The tension between AI capabilities, user expectations, and economic realities will define the next chapter of our industry. As AI tools continue their remarkable evolution, we may need to evolve our expectations about their cost as well. For now, users should evaluate AI tools based on the outcomes they enable, not merely their price tags. And providers should continue seeking that elusive balance: fair compensation for the incredible value they provide, while making these transformative technologies broadly accessible. Andrew Filev is the CEO and founder of Zencoder, a company that helps developers automate code testing and creation through AI agents. His previous company, Wrike, was acquired for $2.25 billion.