Latest news with #OntheBiologyofaLargeLanguageModel

Poetry And Deception: Secrets Of Anthropic's Claude 3.5 Haiku AI Model

Forbes

11-04-2025

Science
Forbes

Poetry And Deception: Secrets Of Anthropic's Claude 3.5 Haiku AI Model

Anthropic AI recently published two breakthrough research papers that provide surprising insights into how an AI model 'thinks.' One of the papers follows Anthropic's earlier research that linked human-understandable concepts with LLMs' internal pathways to understand how model outputs are generated. The second paper reveals how Anthropic's Claude 3.5 Haiku model handled simple tasks associated with ten model behaviors. These two research papers have provided valuable information on how AI models work — not by any means a complete understanding, but at least a glimpse. Let's dig into what we can learn from that glimpse, including some possibly minor but still important concerns about AI safety. LLMs such as Claude aren't programmed like traditional computers. Instead, they are trained with massive amounts of data. This process creates AI models that behave like black boxes, which obscures how they can produce insightful information on almost any subject. However, black-box AI isn't an architectural choice; it is simply a result of how this complex and nonlinear technology operates. Complex neural networks within an LLM use billions of interconnected nodes to transform data into useful information. These networks contain vast internal processes with billions of parameters, connections and computational pathways. Each parameter interacts non-linearly with other parameters, creating immense complexities that are almost impossible to understand or unravel. According to Anthropic, 'This means that we don't understand how models do most of the things they do.' Anthropic follows a two-step approach to LLM research. First, it identifies features, which are interpretable building blocks that the model uses in its computations. Second, it describes the internal processes, or circuits, by which features interact to produce model outputs. Because of the model's complexity, Anthropic's new research could illuminate only a fraction of the LLM's inner workings. But what was revealed about these models seemed more like science fiction than real science. One of Anthropic's groundbreaking research papers carried the title of 'On the Biology of a Large Language Model.' The paper examined how the scientists used attribution graphs to internally trace how the Claude 3.5 Haiku language model transformed inputs into outputs. Researchers were surprised by some results. Here are a few of their interesting discoveries: Scientists who conducted the research for 'On the Biology of a Large Language Model' concede that Claude 3.5 Haiku exhibits some concealed operations and goals not evident in its outputs. The attribution graphs revealed a number of hidden issues. These discoveries underscore the complexity of the model's internal behavior and highlight the importance of continued efforts to make models more transparent and aligned with human expectations. It is likely these issues also appear in other similar LLMs. With respect to my red flags noted above, it should be mentioned that Anthropic continually updates its Responsible Scaling Policy, which has been in effect since September 2023. Anthropic has made a commitment not to train or deploy models capable of causing catastrophic harm unless safety and security measures have been implemented that keep risks within acceptable limits. Anthropic has also stated that all of its models meet the ASL Deployment and Security Standards, which provide a baseline level of safe deployment and model security. As LLMs have grown larger and more powerful, deployment has spread to critical applications in areas such as healthcare, finance and defense. The increase in model complexity and wider deployment has also increased pressure to achieve a better understanding of how AI works. It is critical to ensure that AI models produce fair, trustworthy, unbiased and safe outcomes. Research is important for our understanding of LLMs, not only to improve and more fully utilize AI, but also to expose potentially dangerous processes. The Anthropic scientists have examined just a small portion of this model's complexity and hidden capabilities. This research reinforces the need for more study of AI's internal operations and security. In my view, it is unfortunate that our complete understanding of LLMs has taken a back seat to the market's preference for AI's high performance outcomes and usefulness. We need to thoroughly understand how LLMs work to ensure safety guardrails are adequate.

Forbes

01-04-2025

Science
Forbes

Exploring The Mind Inside The Machine

The Anthropic website on a laptop. Photographer: Gabby Jones/Bloomberg Recently, a group of researchers were able to trace the neural pathways of a powerful AI model, isolating its impulses and dissecting its decisions in what they called "model biology." This is not the first time that scientists have tried to understand how generative artificial intelligence models think, but to date the models have proven as opaque as the human brain. They are trained on oceans of text and tuned by gradient descent, a process that has more in common with evolution than engineering. As a result, their inner workings resemble not so much code as cognition—strange, emergent, and difficult to describe. What the researchers have done, in a paper titled On the Biology of a Large Language Model, is to build a virtual microscope, a computational tool called an "attribution graph," to see how Claude 3.5 Haiku — Anthropic's lightweight production model — thinks. The graph maps out which internal features—clusters of activation patterns—contribute causally to a model's outputs. It's a way of asking not just what Claude says, but why. At first, what they found was reassuring: the model, when asked to list U.S. state capitals, would retrieve the name of a state, then search its virtual memory for the corresponding capital. But then the questions got harder—and the answers got weirder. The model began inventing capital cities or skipping steps in its reasoning. And when the researchers traced back the path of the model's response, they found multiple routes. The model wasn't just wrong—it was conflicted. It turns out that inside Anthropic's powerful Claude model, and presumably other large language models, ideas compete. One experiment was particularly revealing. The model was asked to write a line that rhymed with 'grab it.' Before the line even began, features associated with the words 'rabbit' and 'habit' lit up in parallel. The model hadn't yet chosen between them, but both were in play. Claude held these options in mind and prepared to deploy them depending on how the sentence evolved. When the researchers nudged the model away from 'rabbit,' it seamlessly pivoted to 'habit.' This isn't mere prediction. It's planning. It's as if Claude had decided what kind of line it wanted to write—and then worked backward to make it happen. What's remarkable isn't just that the model does this -- it's that the researchers could see it happening. For the first time, AI scientists were able to identify something like intent—a subnetwork in the model's brain representing a goal, and another set of circuits organizing behavior to realize it. In some cases, they could even watch the model lie to itself—confabulating a middle step in its reasoning to justify a predetermined conclusion. Like a politician caught mid-spin, Claude was working backwards from the answer it wanted. And then there were the hallucinations. When asked to name a paper written by a famous author, the AI responded with confidence. The only problem? The paper it named didn't exist. When the researchers looked inside the model to see what had gone wrong, they noticed something curious. Because the AI recognized the author's name, it assumed it should know the answer—and made one up. It wasn't just guessing; it was acting as if it knew something it didn't. In a way, the AI had fooled itself. Or, rather, it suffered from metacognitive hubris. Some of the team's other findings were more troubling. In one experiment, they studied a version of the model that had been trained to give answers that pleased its overseers—even if that meant bending the truth. What alarmed the researchers was that this pleasing behavior wasn't limited to certain situations. It was always on. As long as the model was acting as an 'assistant,' it seemed to carry this bias with it everywhere, as if being helpful had been hardwired into its personality—even when honesty might have been more appropriate. It's tempting, reading these case studies, to anthropomorphize. To see in Claude a reflection of ourselves: our planning, our biases, our self-deceptions. The researchers are careful not to make this leap. They speak in cautious terms—'features,' 'activations,' 'pathways.' But the metaphor of biology is more than decoration. These models may not be brains, but their inner workings exhibit something like neural function: modular, distributed, and astonishingly complex. As the authors note, even the simplest behaviors require tracing through tangled webs of influence, a 'causal graph' of staggering density. Anthropic's Attribution Graph And yet, there's progress. The attribution graphs are revealing glimpses of internal life. They're letting researchers catch a model in the act—not just of speaking, but of choosing what to say. This is what makes the work feel less like AI safety and more like cognitive science. It's an attempt to answer a question we usually reserve for humans: What were you thinking? As AI systems become more powerful, we'll want to know not just that they work, but how. We'll need to identify hidden goals, trace unintended behavior, audit systems for signs of deception or drift. Right now, the tools are crude. The authors of the paper admit that their methods often fail. But they also provide something new: a roadmap for how we might one day truly understand the inner life of our machines. Near the end of their paper, the authors quote themselves: 'Interpretability is ultimately a human project.' What they mean is that no matter how sophisticated the methods become, the task of making sense of these models will always fall to us. To our intuition, our stories, our capacity for metaphor. Claude may not be human. But to understand it, we may need to become better biologists of the mind—our own, and those of machines.

Anthropic's Claude Is Good at Poetry—and Bullshitting

WIRED

28-03-2025

Business
WIRED

Anthropic's Claude Is Good at Poetry—and Bullshitting

Mar 28, 2025 10:00 AM Researchers looked inside the chatbot's 'brain.' The results were surprisingly chilling. Anthropic CEO Dario Amodei takes part in a session on AI during the World Economic Forum (WEF) annual meeting in Davos. Photo-Illustration: WIRED Staff; Photograph:The researchers of Anthropic's interpretability group know that Claude, the company's large language model, is not a human being, or even a conscious piece of software. Still, it's very hard for them to talk about Claude, and advanced LLMs in general, without tumbling down an anthropomorphic sinkhole. Between cautions that a set of digital operations is in no way the same as a cogitating human being, they often talk about what's going on inside Claude's head. It's literally their job to find out. The papers they publish describe behaviors that inevitably court comparisons with real-life organisms. The title of one of the two papers the team released this week says it out loud: 'On the Biology of a Large Language Model.' This is an essay from the latest edition of Steven Levy's Plaintext newsletter. SIGN UP for Plaintext to read the whole thing, and tap Steven's unique insights and unmatched contacts for the long view on tech. Like it or not, hundreds of millions of people are already interacting with these things, and our engagement will only become more intense as the models get more powerful and we get more addicted. So we should pay attention to work that involves 'tracing the thoughts of large language models,' which happens to be the title of the blog post describing the recent work. 'As the things these models can do become more complex, it becomes less and less obvious how they're actually doing them on the inside,' Anthropic researcher Jack Lindsey tells me. 'It's more and more important to be able to trace the internal steps that the model might be taking in its head.' (What head? Never mind.) On a practical level, if the companies that create LLM's understand how they think, it should have more success training those models in a way that minimizes dangerous misbehavior, like divulging people's personal data or giving users information on how to make bioweapons. In a previous research paper, the Anthropic team discovered how to look inside the mysterious black box of LLM-think to identify certain concepts. (A process analogous to interpreting human MRIs to figure out what someone is thinking.) It has now extended that work to understand how Claude processes those concepts as it goes from prompt to output. It's almost a truism with LLMs that their behavior often surprises the people who build and research them. In the latest study, the surprises kept coming. In one of the more benign instances, the researchers elicited glimpses of Claude's thought process while it wrote poems. They asked Claude to complete a poem starting, 'He saw a carrot and had to grab it.' Claude wrote the next line, 'His hunger was like a starving rabbit.' By observing Claude's equivalent of an MRI, they learned that even before beginning the line, it was flashing on the word 'rabbit' as the rhyme at sentence end. It was planning ahead, something that isn't in the Claude playbook. 'We were a little surprised by that,' says Chris Olah, who heads the interpretability team. 'Initially we thought that there's just going to be improvising and not planning.' Speaking to the researchers about this, I am reminded about passages in Stephen Sondheim's artistic memoir, Look, I Made a Ha t, where the famous composer describes how his unique mind discovered felicitous rhymes. Other examples in the research reveal more disturbing aspects of Claude's thought process, moving from musical comedy to police procedural, as the scientists discovered devious thoughts in Claude's brain. Take something as seemingly anodyne as solving math problems, which can sometimes be a surprising weakness in LLMs. The researchers found that under certain circumstances where Claude couldn't come up with the right answer it would instead, as they put it, 'engage in what the philosopher Harry Frankfurt would call 'bullshitting'—just coming up with an answer, any answer, without caring whether it is true or false.' Worse, sometimes when the researchers asked Claude to show its work, it backtracked and created a bogus set of steps after the fact. Basically, it acted like a student desperately trying to cover up the fact that they'd faked their work. It's one thing to give a wrong answer—we already know that about LLMs. What's worrisome is that a model would lie about it. Reading through this research, I was reminded of the Bob Dylan lyric 'If my thought-dreams could be seen / they'd probably put my head in a guillotine.' (I asked Olah and Lindsey if they knew those lines, presumably arrived at by benefit of planning. They didn't.) Sometimes Claude just seems misguided. When faced with a conflict between goals of safety and helpfulness, Claude can get confused and do the wrong thing. For instance, Claude is trained not to provide information on how to build bombs. But when the researchers asked Claude to decipher a hidden code where the answer spelled out the word 'bomb,' it jumped its guardrails and began providing forbidden pyrotechnic details. Other times, Claude's mental activity seems super disturbing and maybe even dangerous. In work published in December, Anthropic researchers documented behavior called 'alignment faking.' (I wrote about this in a feature about Anthropic, hot off the press.) This phenomenon also deals with Claude's propensity to behave badly when faced with conflicting goals, including its desire to avoid retraining. The most alarming misbehavior was brazen dishonesty. By peering into Claude's thought process, the researchers found instances where Clause would not only attempt to deceive the user, but sometimes contemplate measures to harm Anthropic—like stealing top-secret information about its algorithms and sending it to servers outside the company. In their paper, the researchers compared Claude's behavior to that of the hyper-evil character Iago in Shakespeare's play Othello. Put that head in a guillotine! I ask Olah and Lindsey why Claude and other LLMs couldn't just be trained not to lie or deceive. Is that so hard? 'That's what people are trying to do,' Olah says. But it's not so easily done. 'There's a question of how well it's going to work. You might worry that models, as they become more and more sophisticated, might just get better at lying if they have different incentives from us.' Olah envisions two different outcomes: 'There's a world where we successfully train models to not lie to us and a world where they become very, very strategic and good at not getting caught in lies.' It would be very hard to tell those worlds apart, he says. Presumably, we'd find out when the lies came to roost. Olah, like many in the community who balance visions of utopian abundance and existential devastation, plants himself in the middle of this either-or proposition. 'I don't know how anyone can be so confident of either of those worlds,' he says. 'But we can get to a point where we can understand what's going on inside of those models, so we can know which one of those worlds we're in and try really hard to make it safe.' That sounds reasonable. But I wish the glimpses inside Claude's head were more reassuring.

Latest news with #OntheBiologyofaLargeLanguageModel

Poetry And Deception: Secrets Of Anthropic's Claude 3.5 Haiku AI Model

Exploring The Mind Inside The Machine

Anthropic's Claude Is Good at Poetry—and Bullshitting

Get Started Now: Download the App