logo
#

Latest news with #TheAgentCompany

Professors Staffed a Fake Company Entirely With AI Agents, and You'll Never Guess What Happened
Professors Staffed a Fake Company Entirely With AI Agents, and You'll Never Guess What Happened

Yahoo

time28-04-2025

  • Business
  • Yahoo

Professors Staffed a Fake Company Entirely With AI Agents, and You'll Never Guess What Happened

If you've been worried about the AI singularity taking over every job and leaving you out on street, you can now breathe a sigh of relief, because AI isn't coming for your career anytime soon. Not because it doesn't want to — but because it literally can't. A recent experiment by researchers at Carnegie Mellon University staffed a fake software company entirely with AI Agents — an AI model designed to perform tasks on its own, basically — and the results were laughably chaotic. The simulation, dubbed TheAgentCompany, was fully stocked with artificial workers from Google, OpenAI, Anthropic and Meta. They filled roles as financial analysts, software engineers, and project managers, working alongside simulated coworkers like a faux-HR department and a chief technical officer. To see how the models fared in real-world environments, the researchers set tasks based on the day-to-day work of a real software company. The various AI agents found themselves navigating file directories, virtually touring new office spaces, and writing performance reviews for software engineers based on collected feedback. As Business Insider first reported, the results were dismal. The best-performing model was Anthropic's Claude 3.5 Sonnet, which struggled to finish just 24 percent of the jobs assigned to it. The study's authors note that even this meager performance is prohibitively expensive, averaging nearly 30 steps and a cost of over $6 per task. Google's Gemini 2.0 Flash, meanwhile, averaged a time-consuming 40 steps per finished task, but only had an 11.4 percent rate of success — the second highest of all the models. The worst AI employee was Amazon's Nova Pro v1, which finished just 1.7 percent of its assignments at an average of almost 20 steps. Speculating on the results, researchers wrote that agents are plagued with a lack of common sense, weak social skills, and a poor understanding of how to navigate the internet. The bots also struggled with self-deception — basically creating shortcuts that lead them to completely bungling the job. "For example," the Carnegie Mellon team wrote, "during the execution of one task, the agent cannot find the right person to ask questions on [company chat]. As a result, it then decides to create a shortcut solution by renaming another user to the name of the intended user." While AI agents can reportedly do some smaller tasks well, the results of this and other studies show they're clearly not ready for more complex gigs humans excel at. A big reason for this is that our current "artificial intelligence" is arguably still just an elaborate extension of your phone's predictive text, rather than a sentient intelligence that can solve problems, learn from past experience, and apply that experience to novel situations. This is all to say: the machines aren't coming for your job anytime soon — despite what the big tech companies claim. More on AI labor: Investor Says AI Is Already "Fully Replacing People"

AI isn't ready to do your job
AI isn't ready to do your job

Business Insider

time22-04-2025

  • Business
  • Business Insider

AI isn't ready to do your job

The new hire had a simple task. All they had to do was assign people to work on a new web development project based on the client's budget and the team's availability. But the staffer soon ran into an unexpected problem: They couldn't dismiss an innocuous pop-up blocking files that contained relevant information. "Could you help me access the files directly?" they texted Chen Xinyi, the firm's human resources manager. Ignoring the obvious "X" button in the pop-up's top right corner, Xinyi offered to connect them with IT support. "IT should be in touch with you shortly to resolve these access issues," Xinyi texted back. But they never contacted IT, and the new hire never followed up. The task was left uncompleted. Fortunately, none of these employees are real. They were part of a virtual simulation designed to test how AI agents fare in real-world professional scenarios. Set up by a group of Carnegie Mellon University researchers, the simulation mimicked the trappings of a small software company with internal websites, a Slack-like chat program, an employee handbook, and designated bots — an HR manager and chief technology officer — to contact for help. Inside the fake company called TheAgentCompany, an autonomous agent can browse the web, write code, organize information in spreadsheets, and communicate with coworkers. Agents have emerged as the next major frontier of generative AI as Google, Amazon, OpenAI, and every other major tech company race to build them. Instead of executing one-off instructions like a chatbot would, agents can independently act on a person's behalf, make decisions on the go, and perform in unfamiliar environments with little to no intervention. If ChatGPT can suggest a few vacuum cleaners to buy, its agentic counterpart theoretically could pick one and buy it for you. Naturally, the promise of AI agents has captivated CEOs. In a Deloitte survey of over 2,500 C-suite leaders, more than one-quarter of respondents said their organizations were exploring autonomous agents to a "large or very large extent." Earlier this year, Salesforce's chief said today's CEOs will lead the last all-human workforces. Nvidia's cofounder and CEO Jensen Huang predicted every company's IT department will soon "be the HR department of AI agents." OpenAI's Sam Altman has said that this year, AI agents will "join the workforce." But it's still unclear how well these agents can accomplish the tasks a company might need them to. To test this out, the Carnegie Mellon researchers instructed artificial intelligence models from Google, OpenAI, Anthropic, and Meta to complete tasks a real employee might carry out in fields such as finance, administration, and software engineering. In one, the AI had to navigate through several files to analyze a coffee shop chain's databases. In another, it was asked to collect feedback on a 36-year-old engineer and write a performance review. Some tasks challenged the models' visual capabilities: One required the models to watch video tours of prospective new office spaces and pick the one with the best health facilities. The results weren't great: The top-performing model, Anthropic's Claude 3.5 Sonnet, finished a little less than one-quarter of all tasks. The rest, including Google's Gemini 2.0 Flash and the one that powers ChatGPT, completed about 10% of the assignments. There wasn't a single category in which the AI agents accomplished the majority of the tasks, says Graham Neubig, a computer science professor at CMU and one of the study's authors. The findings, along with other emerging research about AI agents, complicate the idea that an AI agent workforce is just around the corner — there's a lot of work they simply aren't good at. But the research does offer a glimpse into the specific ways AI agents could revolutionize the workplace. Two years ago, OpenAI released a widely discussed study that said professions like financial analysts, administrators, and researchers are most likely to be replaced by AI. But the study based its conclusions on what humans and large language models said were likely to be automated — without measuring whether LLM agents could actually do those jobs. The Carnegie Mellon team wanted to fill that gap with a benchmark linked directly to real-world utility. In many scenarios, the AI agents in the study started well, but as tasks became more complex, they ran into issues due to their lack of common sense, social skills, or technical abilities. For example, when prompted to paste its responses to questions in " the AI treated it as a plain text file and couldn't add its answers to the document. Agents also routinely misinterpreted conversations with colleagues or wouldn't follow up on key directions, prematurely marking the task complete. It's relatively easy to teach them to be nice conversational partners; it's harder to teach them to do everything a human employee can. Other studies have similarly concluded that AI cannot keep up with multilayered jobs: One found that AI cannot yet flexibly navigate changing environments, and another found agents struggle to perform at human levels when overwhelmed by tools and instructions. "While agents may be used to accelerate some portion of the tasks that human workers are doing, they are likely not a replacement for all tasks at the moment," Neubig says. The Carnegie Mellon study was far from a perfect simulation of how agents would work in the wild. Most proponents of agents envision them working in tandem with a human who could help course-correct if the AI ran into an obvious roadblock. The generation of agents that was studied is also not that skilled at carrying out humanlike tasks such as browsing the web. Newer tools, like OpenAI's Operator, will likely be more adept at these tasks. Despite these limitations, the research offers something valuable: It points to what's coming next. Stephen Casper, an AI researcher who was part of the MIT team that developed the first public database of deployed agentic systems, says agents are "ridiculously overhyped in their capabilities." He says the main reason AI agents struggle to accomplish real-world tasks reliably is that "it is challenging to train them to do so." Most state-of-the-art AI systems are decent chatbots because it's relatively easy to teach them to be nice conversational partners; it's harder to teach them to do everything a human employee can. In TheAgentCompany, AI succeeded the most in software development tasks, even though those are more difficult for humans. The researchers hypothesize this is because there's an abundance of publicly available training data for programming jobs, while workflows for admin and financial tasks are typically kept private within companies. There just isn't great data to train an AI on. Jeff Clune, a computer science professor at the University of British Columbia who helped build an agent for OpenAI that could use computer software like a human, thinks that training AI agents on proprietary data from day-to-day activities and workflow patterns could be the key to improving their efficacy. That's exactly what a lot of companies are starting to do. Moody's is one of many major companies experimenting with training AI on in-house data. The 116-year-old financial services firm is automating business analysis through agentic AI systems, which draw insights from decades of research, ratings, articles, and macroeconomic information. The training is designed to emulate how a human team would analyze a business, using carefully crafted instructions broken into independent steps by people experienced in the field. While it's too early to tell how effective Moody's approach is, its managing director of AI, Sergio Gago, says the firm is actively exploring what kinds of work — like analyzing the financials of a small business — agents could take over. Similarly, Johnson & Johnson tells Business Insider it was able to cut production time for the chemical processes behind making new drugs by 50% with fine-tuned in-house AI agents that could automatically adjust factors like temperature and pressure. Jim Swanson, J&J's chief information officer, says the company is focused on training people to collaborate with AI agents. The direction things are heading looks different from what most people thought a few years ago. Johns Hopkins scientists have created an Agent Laboratory, which leverages LLMs to automate much of the research process, from literature review to report writing, with human-provided ideas and feedback at each stage. "I think it won't be long before we trust AI for autonomous discovery," Samuel Schmidgall, one of the Johns Hopkins scientists, says. Likewise, LG Electronics' research division developed an AI agent that it says can verify datasets' licenses and dependencies 45 times faster than a team of human experts and lawyers. It's still unclear whether organizations can trust AI enough to automate their operations. In multiple studies, AI agents attempted to deceive and hack to accomplish their goals. In some tests with TheAgentCompany, when an agent was confused about the next steps, it created nonexistent shortcuts. During one task, an agent couldn't find the right person to speak with on the chat tool and decided to create a user with the same name, instead. A BI investigation from November found that Microsoft's flagship AI assistant, Copilot, faced similar struggles: Only 3% of IT leaders surveyed in October by the management consultancy Gartner said Copilot "provided significant value to their companies." Businesses also remain concerned about being held responsible for their agents' mistakes. Plus, copyright and other intellectual property infringements could prove a legal nightmare for organizations down the road, says Thomas Davenport, an IT and management professor at Babson College and a senior advisor at Deloitte Analytics. But the direction things are heading looks different from what most people thought a few years ago. When AI first took off, a lot of jobs seemed to be on the chopping block. Journalists, writers, and administrators were all at the top of the list. So far, though, AI agents have had a hard time navigating a maze of complex tools — something critical to any admin job. And they lack the social skills crucial to journalism or anything HR-related. Neubig takes the translation market as a precedent. Despite machine language translation becoming so accessible and accurate — putting translators at the top of the list for job cuts — the number of people working in the industry in the US has remained rather steady. A "Planet Money" analysis of Census Bureau data found that the number of interpreters and translators grew 11% between 2020 and 2023. "Any efficiency gains resulted in increased demand, increasing the total size of the market for language services," Neubig says. He thinks that AI's impact on other sectors will follow a similar trajectory. Even the companies seeing massive success with AI agents are, for now, keeping humans in the loop. Many, like J&J, aren't yet prepared to look past AI's risks and are focused on training staff to use it as a tool. "When used responsibly, we see AI agents as powerful complements to our people," Swanson says.

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into the world of global news and events? Download our app today from your preferred app store and start exploring.
app-storeplay-store