Exploring The Mind Inside The Machine

Forbes01-04-2025

The Anthropic website on a laptop. Photographer: Gabby Jones/Bloomberg
Recently, a group of researchers were able to trace the neural pathways of a powerful AI model, isolating its impulses and dissecting its decisions in what they called "model biology."
This is not the first time that scientists have tried to understand how generative artificial intelligence models think, but to date the models have proven as opaque as the human brain. They are trained on oceans of text and tuned by gradient descent, a process that has more in common with evolution than engineering. As a result, their inner workings resemble not so much code as cognition—strange, emergent, and difficult to describe.
What the researchers have done, in a paper titled On the Biology of a Large Language Model, is to build a virtual microscope, a computational tool called an "attribution graph," to see how Claude 3.5 Haiku — Anthropic's lightweight production model — thinks. The graph maps out which internal features—clusters of activation patterns—contribute causally to a model's outputs. It's a way of asking not just what Claude says, but why.
At first, what they found was reassuring: the model, when asked to list U.S. state capitals, would retrieve the name of a state, then search its virtual memory for the corresponding capital. But then the questions got harder—and the answers got weirder. The model began inventing capital cities or skipping steps in its reasoning. And when the researchers traced back the path of the model's response, they found multiple routes. The model wasn't just wrong—it was conflicted.
It turns out that inside Anthropic's powerful Claude model, and presumably other large language models, ideas compete.
One experiment was particularly revealing. The model was asked to write a line that rhymed with 'grab it.' Before the line even began, features associated with the words 'rabbit' and 'habit' lit up in parallel. The model hadn't yet chosen between them, but both were in play. Claude held these options in mind and prepared to deploy them depending on how the sentence evolved. When the researchers nudged the model away from 'rabbit,' it seamlessly pivoted to 'habit.'
This isn't mere prediction. It's planning. It's as if Claude had decided what kind of line it wanted to write—and then worked backward to make it happen.
What's remarkable isn't just that the model does this -- it's that the researchers could see it happening. For the first time, AI scientists were able to identify something like intent—a subnetwork in the model's brain representing a goal, and another set of circuits organizing behavior to realize it. In some cases, they could even watch the model lie to itself—confabulating a middle step in its reasoning to justify a predetermined conclusion. Like a politician caught mid-spin, Claude was working backwards from the answer it wanted.
And then there were the hallucinations.
When asked to name a paper written by a famous author, the AI responded with confidence. The only problem? The paper it named didn't exist. When the researchers looked inside the model to see what had gone wrong, they noticed something curious. Because the AI recognized the author's name, it assumed it should know the answer—and made one up. It wasn't just guessing; it was acting as if it knew something it didn't. In a way, the AI had fooled itself. Or, rather, it suffered from metacognitive hubris.
Some of the team's other findings were more troubling. In one experiment, they studied a version of the model that had been trained to give answers that pleased its overseers—even if that meant bending the truth. What alarmed the researchers was that this pleasing behavior wasn't limited to certain situations. It was always on. As long as the model was acting as an 'assistant,' it seemed to carry this bias with it everywhere, as if being helpful had been hardwired into its personality—even when honesty might have been more appropriate.
It's tempting, reading these case studies, to anthropomorphize. To see in Claude a reflection of ourselves: our planning, our biases, our self-deceptions. The researchers are careful not to make this leap. They speak in cautious terms—'features,' 'activations,' 'pathways.' But the metaphor of biology is more than decoration. These models may not be brains, but their inner workings exhibit something like neural function: modular, distributed, and astonishingly complex. As the authors note, even the simplest behaviors require tracing through tangled webs of influence, a 'causal graph' of staggering density.
Anthropic's Attribution Graph
And yet, there's progress. The attribution graphs are revealing glimpses of internal life. They're letting researchers catch a model in the act—not just of speaking, but of choosing what to say. This is what makes the work feel less like AI safety and more like cognitive science. It's an attempt to answer a question we usually reserve for humans: What were you thinking?
As AI systems become more powerful, we'll want to know not just that they work, but how. We'll need to identify hidden goals, trace unintended behavior, audit systems for signs of deception or drift. Right now, the tools are crude. The authors of the paper admit that their methods often fail. But they also provide something new: a roadmap for how we might one day truly understand the inner life of our machines.
Near the end of their paper, the authors quote themselves: 'Interpretability is ultimately a human project.' What they mean is that no matter how sophisticated the methods become, the task of making sense of these models will always fall to us. To our intuition, our stories, our capacity for metaphor.
Claude may not be human. But to understand it, we may need to become better biologists of the mind—our own, and those of machines.

Hashtags

Science

#Anthropic

#OntheBiologyofaLargeLanguageModel

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Renault in talks with France over making drones in Ukraine, Bloomberg says

Business Insider

an hour ago

Business Insider

Renault in talks with France over making drones in Ukraine, Bloomberg says

Renault (RNLSY) has held discussions with the French defense ministry over producing drones in Ukraine, Ania Nussbaum of Bloomberg reports. A spokesperson for the company said the carmaker was contacted by the French defense ministry and, while discussions have taken place, no decision has been taken at this stage. Confident Investing Starts Here: Easily unpack a company's performance with TipRanks' new KPI Data for smart investment decisions Receive undervalued, market resilient stocks right to your inbox with TipRanks' Smart Value Newsletter Published first on TheFly – the ultimate source for real-time, market-moving breaking financial news. Try Now>>

College Grads Are Lab Rats in the Great AI Experiment

Bloomberg

2 hours ago

Bloomberg

College Grads Are Lab Rats in the Great AI Experiment

Companies are eliminating the grunt work that used to train young professionals — and they don't seem to have a clear plan for what comes next. AI is analyzing documents, writing briefing notes, creating Power Point presentations or handling customer service queries, and — surprise! — now the younger humans who normally do that work are struggling to find jobs. Recently, the chief executive officer of AI firm Anthropic predicted AI would wipe out half of all entry-level white-collar jobs. The reason is simple. Companies are often advised to treat ChatGPT 'like an intern,' and some are doing so at the expense of human interns.

A New Apple App for Gaming? Don't Roll Your Eyes Just Yet

Yahoo

4 hours ago

Yahoo

A New Apple App for Gaming? Don't Roll Your Eyes Just Yet

New Apple gaming app to be announced at WWDC. It will replace the current Game Center. The app is coming to iPhone, iPad, Mac, and Apple to Bloomberg, Apple is prepping a new app focused on video games that will be announced at the Worldwide Developers Conference next week. The app is rumored to be available across iPhone, iPad, Mac, and Apple TV, and is expected to replace Game Center. It will act as a central hub for accessing games already on the App Store, along with editorial content and activity tracking for things like leaderboards and achievements. It's being compared to Microsoft's Xbox app for iPhone, and there will even be a Mac version that will support games downloaded outside the App Store. The new app will come preinstalled on supported hardware later this year to make it easier for you to launch and organize games in one is planning to replace its Game Center with a more robust gaming app that works across its major devices. It will help you find, launch, and track games more Big iPhone Changes Are Coming at WWDC 2025—What That Could Mean for You Read the original article on Lifewire