Professors Staffed a Fake Company Entirely With AI Agents, and You'll Never Guess What Happened

Yahoo28-04-2025

If you've been worried about the AI singularity taking over every job and leaving you out on street, you can now breathe a sigh of relief, because AI isn't coming for your career anytime soon. Not because it doesn't want to — but because it literally can't.
A recent experiment by researchers at Carnegie Mellon University staffed a fake software company entirely with AI Agents — an AI model designed to perform tasks on its own, basically — and the results were laughably chaotic.
The simulation, dubbed TheAgentCompany, was fully stocked with artificial workers from Google, OpenAI, Anthropic and Meta. They filled roles as financial analysts, software engineers, and project managers, working alongside simulated coworkers like a faux-HR department and a chief technical officer.
To see how the models fared in real-world environments, the researchers set tasks based on the day-to-day work of a real software company. The various AI agents found themselves navigating file directories, virtually touring new office spaces, and writing performance reviews for software engineers based on collected feedback.
As Business Insider first reported, the results were dismal. The best-performing model was Anthropic's Claude 3.5 Sonnet, which struggled to finish just 24 percent of the jobs assigned to it. The study's authors note that even this meager performance is prohibitively expensive, averaging nearly 30 steps and a cost of over $6 per task.
Google's Gemini 2.0 Flash, meanwhile, averaged a time-consuming 40 steps per finished task, but only had an 11.4 percent rate of success — the second highest of all the models. The worst AI employee was Amazon's Nova Pro v1, which finished just 1.7 percent of its assignments at an average of almost 20 steps.
Speculating on the results, researchers wrote that agents are plagued with a lack of common sense, weak social skills, and a poor understanding of how to navigate the internet.
The bots also struggled with self-deception — basically creating shortcuts that lead them to completely bungling the job. "For example," the Carnegie Mellon team wrote, "during the execution of one task, the agent cannot find the right person to ask questions on [company chat]. As a result, it then decides to create a shortcut solution by renaming another user to the name of the intended user."
While AI agents can reportedly do some smaller tasks well, the results of this and other studies show they're clearly not ready for more complex gigs humans excel at. A big reason for this is that our current "artificial intelligence" is arguably still just an elaborate extension of your phone's predictive text, rather than a sentient intelligence that can solve problems, learn from past experience, and apply that experience to novel situations.
This is all to say: the machines aren't coming for your job anytime soon — despite what the big tech companies claim.
More on AI labor: Investor Says AI Is Already "Fully Replacing People"

Hashtags

#CarnegieMellonUniversity

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Here's how Uber's product chief uses AI at work — and one tool he's going to use next

Yahoo

13 minutes ago

Yahoo

Here's how Uber's product chief uses AI at work — and one tool he's going to use next

Uber's product chief said he uses AI to summarize lengthy reports and do research before launches. Sachin Kansal uses ChatGPT and Gemini to understand Uber's performance in overseas markets. He plans to add Google's NotebookLM to his AI suite. Uber's chief product officer has one AI tool on his to-do list. In an episode of "Lenny's Podcast" released on Sunday, Uber's product chief, Sachin Kansal, shared two ways he is using AI for his everyday tasks at the ride-hailing giant and how he plans to add NotebookLM to his AI suite. Kansal joined Uber eight years ago as its director of product management after working at cybersecurity and taxi startups. He became Uber's product chief last year. Kansal said he uses OpenAI's ChatGPT and Google's Gemini to summarize long reports. "Some of these reports, they're 50 to 100 pages long," he said. "I will never have the time to read them." He said he uses the chatbots to acquaint himself with what's happening and how riders are feeling in Uber's various markets, such as South Africa, Brazil, and Korea. The CPO said his second use case is treating AI like a research assistant, because some large language models now offer a deep research feature. Kansal gave a recent example of when his team was thinking about a new driver feature. He asked ChatGPT's deep research mode about what drivers may think of the add-on. "It's an amazing research assistant and it's absolutely a starting point for a brainstorm with my team with some really, really good ideas," the CPO said. In April, Uber's CEO, Dara Khosrowshahi, said that not enough of his 30,000-odd employees are using AI. He said learning to work with AI agents to code is "going to be an absolute necessity at Uber within a year." Uber did not immediately respond to a request for comment from Business Insider. On the podcast, Kansal also highlighted NotebookLM, Google Lab's research and note-taking tool, which is especially helpful for interacting with documents. He said he doesn't use the product yet, but wants to. "I know a lot of people who have started using it, and that is the next thing that I'm going to use," he said. "Just to be able to build an audio podcast based on a bunch of information that you can consume. I think that's awesome," he added. Kansal was referring to the "Audio Overview" feature, which summarizes uploaded content in the form of two AIs having a voice discussion. NotebookLM was launched in mid-2023 and has quickly become a must-have tool for researchers and AI enthusiasts. Andrej Karpathy, Tesla's former director of AI and OpenAI cofounder, is among those who have praised the tool and its podcast feature. "It's possible that NotebookLM podcast episode generation is touching on a whole new territory of highly compelling LLM product formats," he said in a September post on X. "Feels reminiscent of ChatGPT. Maybe I'm overreacting." Read the original article on Business Insider Error in retrieving data Sign in to access your portfolio Error in retrieving data Error in retrieving data Error in retrieving data Error in retrieving data

YC partners say AI founders are closing huge deals fast by taking a page out of Palantir's early playbook

Yahoo

13 minutes ago

Yahoo

YC partners say AI founders are closing huge deals fast by taking a page out of Palantir's early playbook

AI founders should see themselves as "forward-deployed engineers," said Y Combinator partners. The term, popularized by Palantir, refers to engineers who embed themselves with clients. Founders have closed "six, seven seven-figure deals" by being forward-deployed engineers, said a YC partner. Some AI founders are landing big enterprise deals by doing something old-school: showing up, writing code, and building the perfect demo — fast. YC partners say this strategy is taking off, and it's straight out of Palantir's early playbook. Startup founders should see themselves as "forward-deployed engineers," said Garry Tan, YC's CEO, on an episode of the "Y Combinator" podcast published Friday. The term, popularized by Palantir, refers to engineers who embed themselves with clients to fine-tune the product on-site. Tan, who was Palantir's 10th employee, said the defense tech company's edge came from recognizing that many government agencies and Fortune 500 companies lacked deep technical expertise in the room. Palantir bridged that gap by embedding technically savvy engineers during sales and implementation. Much of Palantir's success comes from its business with the US government. The Department of Defense is its biggest customer, making up 41% of its fourth-quarter revenue. Startup founders need to be "technical," "great product people," and even "ethnographers" and "designers," said Tan, who worked at Palantir from 2005 to 2007." "You want the person on the second meeting to see the demo you put together based on the stuff you heard, and you want them to say, 'Wow, I've never seen anything like that.' And take my money," he added. This hands-on approach is already delivering big results. YC partner Diana Hu said she and her team have seen founders close "six, seven seven-figure deals" with large enterprises by being forward-deployed engineers. Sometimes, she said, a pair of founders wins a deal by walking into a boardroom, gathering context, and coming back the next day with a tailored AI demo. Once the deal is closed, some of these founders go on-site to work closely with customer support teams, continuously fine-tuning the software or language model to improve performance, said YC partner Harj Taggar. Tan said this model gives AI startups a chance to outmaneuver giants like Salesforce, Oracle, and Booz Allen. "You have big fancy salespeople with big strong handshakes, and it's like, how does a really good engineer with a weak handshake go in there and beat them?" Tan said. "It's actually you show them something that they've never seen before, and like, make them feel super heard." Y Combinator did not respond to a request for comment from Business Insider. Read the original article on Business Insider Error in retrieving data Sign in to access your portfolio Error in retrieving data Error in retrieving data Error in retrieving data Error in retrieving data

Microsoft's Bad News—500 Million Windows Users Must Now Decide

Forbes

21 minutes ago

Forbes

Microsoft's Bad News—500 Million Windows Users Must Now Decide

Surprising bad news suddenly hits Microsoft. A new warning has been issued for Windows users, whose PCs have been described as 'magnets for security threats,' just as new data gives Microsoft a surprising bad news story ahead of the critical next few months. You can expect many more such warnings as 500-million Windows users face an increasingly urgent decision. The latest advice comes courtesy of PC maker Asus, pointing out that 'if you're still using Windows 10 or, dare we say it, something even older — your computer's days of regular updates and support are numbered.' As for upgrades, 'what makes Windows 11 different?," Asus says. "one word: Copilot," as it pushes the latest range of AI PCs. Clearly, you don't need to decide on a premium Copilot PC to benefit from Windows 11's future-proofing, ensuring your PC receives critical security updates after Windows 10's demise in October. AI PCs remain a niche, despite projections they will eventually dominate new PC sales. Right now, there's a more fundamental decision to make. Windows 10 versus Windows 11 globally. The latest Windows market data presents a painfully bleak picture with just over five months to run until free Windows 10 security updates end for all users. Paid extensions are available, but they're expensive for enterprises and restricted to just 12-months for home users who also must pay. Microsoft is pushing free upgrades not paid extensions. A month ago, it seemed Windows 11 had turned the tide against Windows 10. The newer OS already outanks its older sibling in the U.S. but not globally. Come the end of April, though, Windows 11 was within 10% of Windows 11 for the first time. 'Just over half (53%) of all users are still on Windows 10, but that's inching down month by month.' Not any more, it seems. While more directional than exact, Statcounter's data at the end of May shows a slight month-over-month increase for Windows 10, while Windows 11 dips. This after four months of steady progress the other way. Windows 10 is holding stubbornly above 50% while Windows 11 remains 10% behind. Windows 10 versus Windows 11 in U.S. This means there are around 750 million users are yet to upgrade to Windows 11, of which at least 240 million don't have an eligible PC. That still leaves around 500 million users who can take up Microsoft's offer for a free Windows 11 upgrade but have not. Even in the U.S., where Windows 11 has overtaken Windows 10, May's data suggests Windows 10 has grown its share from 41% in April to more than 43%, while Windows 11 drops a more worrying 3.5%, from 56.5% down to below 53%. All this makes June's data critical. Come the end of this month, there will be just three months until Windows 10 is shuttered. If Microsoft is to avoid a cybersecurity nightmare hitting mid-October, something need to change. For all those Windows 10 users with PCs eligible for a free upgrade, do not run out of time.