logo
Testosterone Therapy Is Booming. But Is It Actually Safe?

Testosterone Therapy Is Booming. But Is It Actually Safe?

As more men turn to testosterone replacement therapy (TRT) for energy, mood and muscle, experts warn the risks are still not fully understood.
By , Stephanie Pappas, Fonda Mwangi & Alex Sugiura
This episode was made possible by the support of Yakult and produced independently by Scientific American 's board of editors.
Rachel Feltman: For Scientific American 's Science Quickly, I'm Rachel Feltman.
Whether it's framed as a cure-all for fatigue and low libido or a shortcut to gaining muscle mass, testosterone replacement therapy, or TRT, is all over the Internet these days. But how much of the hype is actually backed by science?
On supporting science journalism
If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
Here to help us make sense of the testosterone boom is Stephanie Pappas, a freelance reporter based in Colorado. Stephanie recently covered the growing popularity—and availability—of TRT for Scientific American.
Thanks so much for coming on to chat.
Stephanie Pappas: Thank you.
Feltman: So you recently wrote about testosterone replacement therapy for Scientific American. For folks who are not on the right part of the Internet to have heard all about this—or maybe staying off the wrong parts [laughs] of the Internet, depending on your perspective—what's going on with TRT right now?
Pappas: Well, testosterone replacement therapy has become extremely popular. It has been something that's been in the background for many, many years. Synthetic testosterone was first invented in 1935, but for a long, long time people thought that testosterone replacement, if it was used for any kind of symptoms men might be having, that it could cause prostate cancer. And then it was believed, perhaps, it could cause heart disease or cardiovascular events like a stroke or a heart attack.
As it turns out the last few years we found that it doesn't really cause these really serious events. However, a lot less is known about the long-term health impacts. People are really flocking to TRT largely as a result of word of mouth. There are a lot of private clinics that offer this out of pocket, so you don't have to have an insurance company agree that you need it. And people on social media are using it for just a litany of different symptoms, and it can be anything from muscle-building to fatigue to mood problems and irritability, and it's kind of being pitched as a cure-all for a lot of different things.
Feltman: And what evidence is there for the benefits of testosterone replacement therapy, maybe starting with people who actually have low testosterone?
Pappas: Yeah, so there is such a thing as low testosterone. No one exactly agrees on what the cutoff is, and probably that's because there's a lot of variability in our hormones—like, anyone who's ever tried any sort of hormone treatment, including birth control or HRT [hormone replacement therapy], can tell you that people respond really differently.
So for men who really do have low testosterone, the evidence suggests that you can see some benefits in mood if you have major depression. You may see some improvements in energy. The most well-established result from the studies of TRT is that you'll probably see a little boost in libido if you have low testosterone and you now start taking TRT, and that's because testosterone works in the brain to increase sexual desire.
Feltman: Hmm.
Pappas: For men who don't have low testosterone, which are many of the men who are now getting treatment, the evidence for benefits is much, much lower. We don't know if you really see much besides additional muscle-building abilities.
Feltman: And what are the potential downsides? You mentioned that one of the reasons there's such a boom right now is that research has showed that the connection to prostate cancer is not concerning the way we once thought it was. But what about other issues that can come up when you don't have low testosterone and you start taking a bunch of testosterone?
Pappas: Right, so if you are taking a testosterone supplement, your body actually shuts down its own testosterone production. There's this neat little feedback loop that says, 'Oh, if the testosterone's high in the blood, we're going to just kind of ramp it down.' And a side effect of that is, actually, because testosterone is involved in sperm production, your body will also stop producing sperm. So as more younger men turn to TRT, we are seeing that men who are interested in still having children are finding they're losing their fertility. Oftentimes men are told, 'Oh, you'll recover it once you stop.' But that can actually be slow and complicated, so urologists in the field often see men who aren't understanding why they're not, you know, able to get their partner pregnant, and they may have tried for quite some time.
Feltman: Right, and, you know, not that this is the reason that's upsetting, but there is also kind of an irony there because a lot of the marketing is sort of stereotypical masculinity, so it's not surprising that people are caught off guard by that potential downside.
Pappas: Yes, absolutely. They are really marketing this—if you go, you can see it on billboards or online—these ads are all about muscles, they're about machoism. And oftentimes the reports from some of these freestanding clinics is that men are not being told all the information about all the side-effect possibilities.
Feltman: When you say that regaining fertility after these treatments can be complex and slow, could you walk us through what you mean by that?
Pappas: Sure, because your own testosterone levels and sperm production drop, you're going to have to, usually, get off the testosterone. That can really lead to a hormone crash; since your body is, really, at that point in quite low testosterone, you may feel irritable, you may feel fatigue. So you're gonna have to go through that—a bit of a roller coaster.
Doctors will prescribe some medications that can help even out your levels and help encourage your body to start producing its own sperm again. That can take some time; it can be a little expensive. Urologists can help you, though. But they do say that they are concerned that men have a, often, too rosy picture of what that's gonna look like. It can take up to two years to recover full fertility, there's kind of an unknown as to whether sperm quality will be quite as high as it was beforehand. And as anyone who's trying to have kids knows, two years can be quite a while when you're dealing with fertility problems.
Feltman: Yeah, so let's talk some more about those freestanding clinics. You know, in addition to TRT, you know, being more in demand and more in the conversation, it also seems like it's more accessible than ever, so what are some of the sort of concerning characteristics of these clinics that are popping up?
Pappas: Well, you don't wanna paint all clinics with the same brush ...
Feltman: Sure.
Pappas: Because there is a wide variety of care out there. So it can be any provider that can prescribe—because testosterone is a controlled substance—but they may not really be running you through a full workup, as a urologist or an endocrinologist affiliated with a practice or a hospital system might do.
The recommendations from professional societies suggest you get two testosterone tests on different days because testosterone levels swing wildly. I could not find anyone who'd reported to me that they'd gotten two tests. I can't say that there aren't clinics that do it. Typically you're gonna get one test. Typically they are motivated to prescribe what they can to you.
The problem, often, is that because of this long-term fear around testosterone, is that many primary care doctors are nervous about prescribing it or don't feel that they've been trained. I spoke to one man who, actually, his doctor said, 'Yes, your testosterone is undeniably low, but I don't know what to do about it. Maybe just go to one of these clinics, and they can help you.'
His experience in that clinic, unfortunately, was that they kind of gave him a generic prescription, did not really test through his levels, didn't really talk through, you know, alternative treatments or other things he might look at doing. So he felt his loss and he ended up looking on Reddit for advice, which, as we all know [laughs], is a real hit-and-miss proposition ...
Feltman: Sure.
Pappas: So men are often kind of left searching for their own information, and they may not have good sources of information.
Feltman: And the experts that you spoke to, what do they wanna see change about the way we're treating TRT?
Pappas: The first step is that a lot of physicians who specialize in hormone replacement therapy for men would like to see more awareness among primary care physicians and other doctors that men might go to, because if they could coordinate that care in a really responsible way, there are probably many men who could benefit: they do have low testosterone but haven't ever thought about being tested.
And then the other side of this is just patient education. If you're going to consider going to a clinic, don't just go somewhere that will happily hand you a prescription. Really look for someone who is going to sit down with you, who is going to talk through lifestyle changes, who's going to look at alternative problems. So one doctor I spoke to said, 'The first thing we do is we look for sleep apnea in our patients. If we can cure that, oftentimes we don't need to look at their testosterone levels again.'
And don't be in a rush to walk out that first day with a prescription that might be too high for you and might lead to side effects like acne, or another side effect you can see is an overgrowth of red blood cells that can lead you to need to have to donate blood every month to keep that in normal range. Look for something that's not going to cause the side effects that can really affect your life in the long term.
Feltman: Sure, well, thank you so much for coming on to talk us through your feature. I really appreciate it.
Pappas: Thank you so much.
Feltman: That's all for today's episode. You can read Stephanie's full story on TRT in the July/August issue of Scientific American.
We'll be back next week with something special: a three-part miniseries on bird flu. From avian influenza's wild origins to its spread across U.S. farms to the labs trying to keep it from becoming the next pandemic, this looming public health threat has a lot of moving parts, but we'll get you all caught up.
Science Quickly is produced by me, Rachel Feltman, along with Fonda Mwangi, Kelso Harper, Naeem Amarsy and Jeff DelViscio. This episode was edited by Alex Sugiura. Shayna Posses and Aaron Shattuck fact-check our show. Our theme music was composed by Dominic Smith. Subscribe to Scientific American for more up-to-date and in-depth science news.
Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

4 Fitness Tests Trainers Swear By
4 Fitness Tests Trainers Swear By

New York Times

time10 hours ago

  • New York Times

4 Fitness Tests Trainers Swear By

Whether you are new to fitness or looking to break an exercise plateau, you need to test yourself occasionally. Evaluating your base line fitness might tell you that your right leg is stronger than your left, for example, or that your core is weaker than you thought, allowing you to tailor your workouts and improve. Hiring a trainer to do a full assessment of your fitness is the best way to create a fully personalized plan. But there are some effective tests you can do on your own to get an idea of your strengths and weaknesses. There are hundreds of fitness tests available on the internet — to find the ones that are worthwhile, we asked trainers and exercise experts to pick four tried-and-tested self-assessments for strength, balance and cardiovascular fitness. 'No test is the best test, but these oldies are goodies,' said Mark Murphy, a sports physical therapist at Mass General Brigham's Center for Sports Performance Research in Foxborough, Mass. These tests aren't meant to be pass-fail, Mr. Murphy added. You can compare your scores against data for your gender and age, but it's more important to compete with yourself, he said. 'Think of your scores as benchmarks for building better fitness.' Want all of The Times? Subscribe.

Rogue Worlds May Not Be So Lonely After All, Europa Clipper Completes Key Test, and RFK, Jr., Pulls $500 Million in mRNA Vaccine Funding
Rogue Worlds May Not Be So Lonely After All, Europa Clipper Completes Key Test, and RFK, Jr., Pulls $500 Million in mRNA Vaccine Funding

Scientific American

timea day ago

  • Scientific American

Rogue Worlds May Not Be So Lonely After All, Europa Clipper Completes Key Test, and RFK, Jr., Pulls $500 Million in mRNA Vaccine Funding

Rachel Feltman: Happy Monday, listeners! For Scientific American 's Science Quickly, I'm Rachel Feltman. Let's kick off the week with our usual science news roundup. Let's start with some space news. Have you ever heard of rogue planets? They sound pretty cool, and they are: the term refers to exoplanets that roam free instead of orbiting a star. Some of them may be objects that formed like stars, coalescing in the wake of a giant gas cloud's collapse but never gaining enough mass to actually start the process of nuclear fusion. Others may get their start in the usual planetary way—forming from the gas and dust around a star—before getting ejected out into open space for some reason or another. According to a preprint study made available last month, the life of a rogue planet might not always be as lonely as it sounds. Some of them may be able to form little planetary systems of their own. On supporting science journalism If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today. The researchers behind the new study, which still has to go through peer review, used instruments on the James Webb Space Telescope to gather information about eight different rogue planets, each with a mass around five to 10 times greater than Jupiter's. Based on infrared observations, the scientists say, six of the objects seem to have warm dust around them, indicating the presence of the kinds of disks where planets form. The researchers also saw silicate grains in the disks—evidence that the dust is growing and crystallizing. That's typically a disk's signature move when it's gearing up to make some baby planets. This study didn't actually find any hints of fully grown planets orbiting those giant rogue worlds, but it suggests that such a phenomenon might be possible. As wild as it is to imagine a lonely world roaming space without a star to orbit, it's even more intriguing to picture a whole system of planets spinning in the dark. Speaking of space, NASA's Europa Clipper, which is expected to arrive at the Jupiter system in 2030 so it can study the gas giant's icy moon, has completed an important test. Back in March 2025 the Europa Clipper flew past Mars and conducted a test of its REASON instrument. That's short for Radar for Europa Assessment and Sounding: Ocean to Near-surface. This radar is a crucial component of the clipper's mission because it's designed to peek beneath the icy shell of Europa's surface, perhaps even glimpsing the ocean beneath it. The radar will also help NASA scientists study the ice itself, along with the topography of Europa's surface. The clipper features a huge pair of solar arrays that carry the slender antennas REASON needs to do its work. The antennas span a distance of about 58 feet, while the arrays collectively stretch the length of a basketball court, which is necessary for them to gather enough light—Europa gets just around 1/25th as much sunlight as we do on Earth. The sheer size of all those components made it impossible to fully test REASON on Earth because once the flight hardware was finished, the clipper had to be kept inside a clean room. NASA simply didn't have a sterile chamber big enough to properly assess the radar. When Europa flew by Mars on March 1, REASON sent and received radio waves for about 40 minutes, collecting 60 gigabytes of data. Earlier this month NASA announced that scientists had completed their analysis of the data and deemed the REASON instrument ready for prime time. Let's move on to some public health news—first, vaccines. Last Tuesday, the Guardian reported that COVID cases in the U.S. are on the rise, as has been the case each summer since the start of the pandemic. Though this current surge has seen case numbers growing more slowly than in previous years, experts who spoke to the Guardian voiced concerns about what the coming months could bring. In May, U.S. Food and Drug Administration officials wrote that, come fall, COVID boosters may be limited to older people and individuals at higher risk of getting severely ill. Even if this move doesn't outright prevent people from vaccinating themselves and their kids, public health experts are concerned that confusion around availability and insurance coverage could lead to a worrisome dip in booster administration. Meanwhile, U.S. Department of Health and Human Services head Robert F. Kennedy Jr. announced last Tuesday that his department is canceling almost $500 million in funding for the development of mRNA vaccines. While experts say mRNA vaccines are safe, have the potential to curb future pandemics, and have already saved millions of lives, Kennedy has come out against the technology. Mike Osterholm, a University of Minnesota expert on infectious diseases and pandemic preparedness, told the Associated Press that he didn't think he'd witnessed 'a more dangerous decision in public health' in his 50 years in the field. We're hoping to focus on explaining mRNA technology in an upcoming episode, so let us know if you have any questions we can answer. You can send those to ScienceQuickly@ In other public health news, a group of scientists say bird flu could be airborne on some dairy farms. In a preprint paper recently posted online, researchers report finding H5N1 influenza virus in both large and small aerosol particles in air sampled from California farms. The scientists also found viral particles in milk, on milking equipment and in wastewater. While H5N1 isn't currently thought to pose a major health risk to humans, its continued circulation in mammals leaves us open to potentially dangerous mutations of the virus. We'll end this week's roundup with a fun little story about how terrifying humans are. Earlier this month the Wall Street Journal reported that U.S. Department of Agriculture workers are blasting human music and voices from speaker-touting drones to scare wolves away from livestock. Apparently the audio selections for these so-called wolf-hazing attempts include the sounds of fireworks, AC/DC's song 'Thunderstruck' and, perhaps most delightfully, that scene from the movie Marriage Story where Scarlett Johansson and Adam Driver scream at each other. Apparently Driver and ScarJo are doing the trick: the Wall Street Journal reported that noisemaking drones were deployed in southern Oregon after wolves killed 11 cows in the area over the span of 20 days. Once the drones were in hazing mode, there were reportedly just two fatal wolf attacks on cattle in an 85-day period. There's no word yet on how the wolves feel about Laura Dern. That's all for this week's science news roundup. We'll be back on Wednesday to talk about the latest advances in male contraception. Science Quickly is produced by me, Rachel Feltman, along with Fonda Mwangi, Kelso Harper and Jeff DelViscio. This episode was edited by Alex Sugiura. Shayna Posses and Aaron Shattuck fact-check our show. Our theme music was composed by Dominic Smith. Subscribe to Scientific American for more up-to-date and in-depth science news. For Scientific American, this is Rachel Feltman. Have a great week!

Will Reinforcement Learning Take Us To AGI?
Will Reinforcement Learning Take Us To AGI?

Forbes

time2 days ago

  • Forbes

Will Reinforcement Learning Take Us To AGI?

There aren't many truly new ideas in artificial intelligence. More often, breakthroughs in AI happen when concepts that have existed for years suddenly take on new power because underlying technology inputs—in particular, raw computing power—finally catch up to unlock those concepts' full potential. Famously, Geoff Hinton and a small group of collaborators devoted themselves tirelessly to neural networks starting in the early 1970s. For decades, the technology didn't really work and the outside world paid little attention. It was not until the early 2010s—thanks to the arrival of sufficiently powerful Nvidia GPUs and internet-scale training data—that the potential of neural networks was finally unleashed for all to see. In 2024, more than half a century after he began working on neural networks, Hinton was awarded the Nobel Prize for pioneering the field of modern AI. Reinforcement learning has followed a similar arc. Richard Sutton and Andrew Barto, the fathers of modern reinforcement learning, laid down the foundations of the field starting in the 1970s. Even before Sutton and Barto began their work, the basic principles underlying reinforcement learning—in short, learning by trial and error based on positive and negative feedback—had been developed by behavioral psychologists and animal researchers going back to the early twentieth century. Yet in just the past year, advances in reinforcement learning (RL) have taken on newfound importance and urgency in the world of AI. It has become increasingly clear that the next leap in AI capabilities will be driven by RL. If artificial general intelligence (AGI) is in fact around the corner, reinforcement learning will play a central role in getting us there. Just a few years ago, when ChatGPT's launch ushered in the era of generative AI, almost no one would have predicted this. Deep questions remain unanswered about reinforcement learning's capabilities and its limits. No field in AI is moving more quickly today than RL. It has never been more important to understand this technology, its history and its Learning 101 The basic principles of reinforcement learning have remained consistent since Sutton and Barto established the field in the 1970s. The essence of RL is to learn by interacting with the world and seeing what happens. It is a universal and foundational form of learning; every human and animal does it. In the context of artificial intelligence, a reinforcement learning system consists of an agent interacting with an environment. RL agents are not given direct instructions or answers by humans; instead, they learn through trial and error. When an agent takes an action in an environment, it receives a reward signal from the environment, indicating that the action produced either a positive or a negative outcome. The agent's goal is to adjust its behavior to maximize positive rewards and minimize negative rewards over time. How does the agent decide which actions to take? Every agent acts according to a policy, which can be understood as the formula or calculus that determines the agent's action based on the particular state of the environment. A policy can be a simple set of rules, or even pure randomness, or it can be represented by a far more complex system, like a deep neural network. One final concept that is important to understand in RL, closely related to the reward signal, is the value function. The value function is the agent's estimate of how favorable a given state of the environment will be (that is, how many positive and negative rewards it will lead to) over the long run. Whereas reward signals are immediate pieces of feedback that come from the environment based on current conditions, the value function is the agent's own learned estimate of how things will play out in the long term. The entire purpose of value functions is to estimate reward signals, but unlike reward signals, value functions enable agents to reason and plan over longer time horizons. For instance, value functions can incentivize actions even when they lead to negative near-term rewards because the long-term benefit is estimated to be worth it. When RL agents learn, they do so in one of three ways: by updating their policy, updating their value function, or updating both together. A brief example will help make these concepts concrete. Imagine applying reinforcement learning to the game of chess. In this case, the agent is an AI chess player. The environment is the chess board, with any given configuration of chess pieces representing a state of that environment. The agent's policy is the function (whether a simple set of rules, or a decision tree, or a neural network, or something else) that determines which move to make based on the current board state. The reward signal is simple: positive when the agent wins a game, negative when it loses a game. The agent's value function is its learned estimate of how favorable or unfavorable any given board position is—that is, how likely the position is to lead to a win or a loss. As the agent plays more games, strategies that lead to wins will be positively reinforced and strategies that lead to losses will be negatively reinforced via updates to the agent's policy and value function. Gradually, the AI system will become a stronger chess player. In the twenty-first century, one organization has championed and advanced the field of reinforcement learning more than any other: DeepMind. Founded in 2010 as a startup devoted to solving artificial intelligence and then acquired by Google in 2014 for ~$600 million, London-based DeepMind made a big early bet on reinforcement learning as the most promising path forward in AI. And the bet paid off. The second half of the 2010s were triumphant years for the field of reinforcement learning. In 2016, DeepMind's AlphaGo became the first AI system to defeat a human world champion at the ancient Chinese game of Go, a feat that many AI experts had believed was impossible. In 2017, DeepMind debuted AlphaZero, which taught itself Go, chess and Japanese chess entirely via self-play and bested every other AI and human competitor in those games. And in 2019, DeepMind unveiled AlphaStar, which mastered the video game StarCraft—an even more complex environment than Go given the vast action space, imperfect information, numerous agents and real-time gameplay. AlphaGo, AlphaZero, AlphaStar—reinforcement learning powered each of these landmark achievements. As the 2010s drew to a close, RL seemed poised to dominate the coming generation of artificial intelligence breakthroughs, with DeepMind leading the way. But that's not what happened. Right around that time, a new AI paradigm unexpectedly burst into the spotlight: self-supervised learning for autoregressive language models. In 2019, a small nonprofit research lab named OpenAI released a model named GPT-2 that demonstrated surprisingly powerful general-purpose language capabilities. The following summer, OpenAI debuted GPT-3, whose astonishing abilities represented a massive leap in performance from GPT-2 and took the AI world by storm. In 2022 came ChatGPT. In short order, every AI organization in the world reoriented its research focus to prioritize large language models and generative AI. These large language models (LLMs) were based on the transformer architecture and made possible by a strategy of aggressive scaling. They were trained on unlabeled datasets that were bigger than any previous AI training data corpus—essentially the entire internet—and were scaled up to unprecedented model sizes. (GPT-2 was considered mind-bogglingly large at 1.5 billion parameters; one year later, GPT-3 debuted at 175 billion parameters.) Reinforcement learning fell out of fashion for half a decade. A widely repeated narrative during the early 2020s was that DeepMind had seriously misread technology trends by committing itself to reinforcement learning and missing the boat on generative AI. Yet today, reinforcement learning has reemerged as the hottest field within AI. What happened? In short, AI researchers discovered that applying reinforcement learning to generative AI models was a killer combination. Starting with a base LLM and then applying reinforcement learning on top of it meant that, for the first time, RL could natively operate with the gift of language and broad knowledge about the world. Pretrained foundation models represented a powerful base on which RL could work its magic. The results have been dazzling—and we are just getting Meets LLMs What does it mean, exactly, to combine reinforcement learning with large language models? A key insight to start with is that the core concepts of RL can be mapped directly and elegantly to the world of LLMs. In this mapping, the LLM itself is the agent. The environment is the full digital context in which the LLM is operating, including the prompts it is presented with, its context window, and any tools and external information it has access to. The model's weights represent the policy: they determine how the agent acts when presented with any particular state of the environment. Acting, in this context, means generating tokens. What about the reward signal and the value function? Defining a reward signal for LLMs is where things get interesting and complicated. It is this topic, more than any other, that will determine how far RL can take us on the path to superintelligence. The first major application of RL to LLMs was reinforcement learning from human feedback, or RLHF. The frontiers of AI research have since advanced to more cutting-edge methods of combining RL and LLMs, but RLHF represents an important step on the journey, and it provides a concrete illustration of the concept of reward signals for LLMs. RLHF was invented by DeepMind and OpenAI researchers back in 2017. (As a side note, given today's competitive and closed research environment, it is remarkable to remember that OpenAI and DeepMind used to conduct and publish foundational research together.) RLHF's true coming-out party, though, was ChatGPT. When ChatGPT debuted in November 2022, the underlying AI model on which it was based was not new; it had already been publicly available for many months. The reason that ChatGPT became an overnight success was that it was approachable, easy to talk to, helpful, good at following directions. The technology that made this possible was RLHF. In a nutshell, RLHF is a method to adapt LLMs' style and tone to be consistent with human-expressed preferences, whatever those preferences may be. RLHF is most often used to make LLMs 'helpful, harmless and honest,' but it can equally be used to make them more flirtatious, or rude, or sarcastic, or progressive, or conservative. How does RLHF work? The key ingredient in RLHF is 'preference data' generated by human subjects. Specifically, humans are asked to consider two responses from the model for a given prompt and to select which one of the two responses they prefer. This pairwise preference data is used to train a separate model, known as the reward model, which learns to produce a numerical rating of how desirable or undesirable any given output from the main model is. This is where RL comes in. Now that we have a reward signal, an RL algorithm can be used to fine-tune the main model—in other words, the RL agent—so that it generates responses that maximize the reward model's scores. In this way, the main model comes to incorporate the style and values reflected in the human-generated preference data. Circling back to reward signals and LLMs: in the case of RLHF, as we have seen, the reward signal comes directly from humans and human-generated preference data, which is then distilled into a reward model. What if we want to use RL to give LLMs powerful new capabilities beyond simply adhering to human preferences?The Next Frontier The most important development in AI over the past year has been language models' improved ability to engage in reasoning. What exactly does it mean for an AI model to 'reason'? Unlike first-generation LLMs, which respond to prompts using next-token prediction with no planning or reflection, reasoning models spend time thinking before producing a response. These models think by generating 'chains of thought,' enabling them to systematically break down a given task into smaller steps and then work through each step in order to arrive at a well-thought-through answer. They also know how and when to use external tools—like a calculator, a code interpreter or the internet—to help solve problems. The world's first reasoning model, OpenAI's o1, debuted less than a year ago. A few months later, China-based DeepSeek captured world headlines when it released its own reasoning model, R1, that was near parity with o1, fully open and trained using far less compute. The secret sauce that gives AI models the ability to reason is reinforcement learning—specifically, an approach to RL known as reinforcement learning from verifiable rewards (RLVR). Like RLHF, RLVR entails taking a base model and fine-tuning it using RL. But the source of the reward signal, and therefore the types of new capabilities that the AI gains, are quite different. As its name suggests, RLVR improves AI models by training them on problems whose answers can be objectively verified—most commonly, math or coding tasks. First, a model is presented with such a task—say, a challenging math problem—and prompted to generate a chain of thought in order to solve the problem. The final answer that the model produces is then formally determined to be either correct or incorrect. (If it's a math question, the final answer can be run through a calculator or a more complex symbolic math engine; if it's a coding task, the model's code can be executed in a sandboxed environment.) Because we now have a reward signal—positive if the final answer is correct, negative if it is incorrect—RL can be used to positively reinforce the types of chains of thought that lead to correct answers and to discourage those that lead to incorrect answers. The end result is a model that is far more effective at reasoning: that is, at accurately working through complex multi-step problems and landing on the correct solution. This new generation of reasoning models has demonstrated astonishing capabilities in math competitions like the International Math Olympiad and on logical tests like the ARC-AGI benchmark. So—is AGI right around the corner? Not necessarily. A few big-picture questions about reinforcement learning and language models remain unanswered and loom large. These questions inspire lively debate and widely varying opinions in the world of artificial intelligence today. Their answers will determine how powerful AI gets in the coming months.A Few Big Unanswered Questions Today's cutting-edge RL methods rely on problems whose answers can be objectively verified as either right or wrong. Unsurprisingly, then, RL has proven exceptional at producing AI systems that are world-class at math, coding, logic puzzles and standardized tests. But what about the many problems in the world that don't have easily verifiable answers? In a provocative essay titled 'The Problem With Reasoners', Aidan McLaughlin elegantly articulates this point: 'Remember that reasoning models use RL, RL works best in domains with clear/frequent reward, and most domains lack clear/frequent reward.' McLaughlin argues that most domains that humans actually care about are not easily verifiable, and we will therefore have little success using RL to make AI superhuman at them: for instance, giving career advice, managing a team, understanding social trends, writing original poetry, investing in startups. A few counterarguments to this critique are worth considering. The first centers on the concepts of transfer learning and generalizability. Transfer learning is the idea that models trained in one area can transfer those learnings to improve in other areas. Proponents of transfer learning in RL argue that, even if reasoning models are trained only on math and coding problems, this will endow them with broad-based reasoning skills that will generalize beyond those domains and enhance their ability to tackle all sorts of cognitive tasks. 'Learning to think in a structured way, breaking topics down into smaller subtopics, understanding cause and effect, tracing the connections between different ideas—these skills should be broadly helpful across problem spaces,' said Dhruv Batra, cofounder/chief scientist at Yutori and former senior AI researcher at Meta. 'This is not so different from how we approach education for humans: we teach kids basic numeracy and literacy in the hopes of creating a generally well-informed and well-reasoning population.' Put more strongly: if you can solve math, you can solve anything. Anything that can be done with a computer, after all, ultimately boils down to math. It is an intriguing hypothesis. But to date, there is no conclusive evidence that RL endows LLMs with reasoning capabilities that generalize beyond easily verifiable domains like math and coding. It is no coincidence that the most important advances in AI in recent months—both from a research and a commercial perspective—have occurred in precisely these two fields. If RL can only give AI models superhuman powers in domains that can be easily verified, this represents a serious limit to how far RL can advance the frontiers of AI's capabilities. AI systems that can write code or do mathematics as well as or better than humans are undoubtedly valuable. But true general-purpose intelligence consists of much more than this. Let us consider another counterpoint on this topic, though: what if verification systems can in fact be built for many (or even all) domains, even when those domains are not as clearly deterministic and checkable as a math problem? Might it be possible to develop a verification system that can reliably determine whether a novel, or a government policy, or a piece of career advice, is 'good' or 'successful' and therefore should be positively reinforced? This line of thinking quickly leads us into borderline philosophical considerations. In many fields, determining the 'goodness' or 'badness' of a given outcome would seem to involve value judgments that are irreducibly subjective, whether on ethical or aesthetic grounds. For instance, is it possible to determine that one public policy outcome (say, reducing the federal deficit) is objectively superior to another (say, expanding a certain social welfare program)? Is it possible to objectively identify that a painting or a poem is or is not 'good'? What makes art 'good'? Is beauty not, after all, in the eye of the beholder? Certain domains simply do not possess a 'ground truth' to learn from, but rather only differing values and tradeoffs to be weighed. Even in such domains, though, another possible approach exists. What if we could train an AI via many examples to instinctively identify 'good' and 'bad' outcomes, even if we can't formally define them, and then have that AI serve as our verifier? As Julien Launay, CEO/cofounder of RL startup Adaptive ML, put it: 'In bridging the gap from verifiable to non-verifiable domains, we are essentially looking for a compiler for natural language…but we already have built this compiler: that's what large language models are.' This approach is often referred to as reinforcement learning from AI feedback (RLAIF) or 'LLM-as-a-Judge.' Some researchers believe it is the key to making verification possible across more domains. But it is not clear how far LLM-as-a-Judge can take us. The reason that reinforcement learning from verifiable rewards has led to such incisive reasoning capabilities in LLMs in the first place is that it relies on formal verification methods: correct and incorrect answers exist to be discovered and learned. LLM-as-a-Judge seems to bring us back to a regime more closely resembling RLHF, whereby AI models can be fine-tuned to internalize whatever preferences and value judgments are contained in the training data, arbitrary though they may be. This merely punts the problem of verifying subjective domains to the training data, where it may remain as unsolvable as ever. We can say this much for sure: to date, neither OpenAI nor Anthropic nor any other frontier lab has debuted an RL-based system with superhuman capabilities in writing novels, or advising governments, or starting companies, or any other activity that lacks obvious verifiability. This doesn't mean that the frontier labs are not making progress on the problem. Indeed, just last month, leading OpenAI researcher Noam Brown shared on X: 'We developed new techniques that make LLMs a lot better at hard-to-verify tasks.' Rumors have even begun to circulate that OpenAI has developed a so-called 'universal verifier,' which can provide an accurate reward signal in any domain. It is hard to imagine how such a universal verifier would work; no concrete details have been shared publicly. Time will tell how powerful these new techniques is important to remember that we are still in the earliest innings of the reinforcement learning era in generative AI. We have just begun to scale RL. The total amount of compute and training data devoted to reinforcement learning remains modest compared to the level of resources spent on pretraining foundation models. This chart from a recent OpenAI presentation speaks volumes: At this very moment, AI organizations are preparing to deploy vast sums to scale up their reinforcement learning efforts as quickly as they can. As the chart above depicts, RL is about to transition from a relatively minor component of AI training budgets to the main focus. What does it mean to scale RL? 'Perhaps the most important ingredient when scaling RL is the environments—in other words, the settings in which you unleash the AI to explore and learn,' said Stanford AI researcher Andy Zhang. 'In addition to sheer quantity of environments, we need higher-quality environments, especially as model capabilities improve. This will require thoughtful design and implementation of environments to ensure diversity and goldilocks difficulty and to avoid reward hacking and broken tasks.' When xAI debuted its new frontier model Grok 4 last month, it announced that it had devoted 'over an order of magnitude more compute' to reinforcement learning than it had with previous models. We have many more orders of magnitude to go. Today's RL-powered models, while powerful, face shortcomings. The unsolved challenge of difficult-to-verify domains, discussed above, is one. Another critique is known as elicitation: the hypothesis that reinforcement learning doesn't actually endow AI models with greater intelligence but rather just elicits capabilities that the base model already possessed. Yet another obstacle that RL faces is its inherent sample inefficiency compared to other AI paradigms: RL agents must do a tremendous amount of work to receive a single bit of feedback. This 'reward sparsity' has made RL impracticable to deploy in many contexts. It is possible that scale will be a tidal wave that washes all of these concerns away. If there is one principle that has defined frontier AI in recent years, after all, it is this: nothing matters more than scale. When OpenAI scaled from GPT-2 to GPT-3 to GPT-4 between 2019 and 2023, the models' performance gains and emergent capabilities were astonishing, far exceeding the community's expectations. At every step, skeptics identified shortcomings and failure modes with these models, claiming that they revealed fundamental weaknesses in the technology paradigm and predicting that progress would soon hit a wall. Instead, the next generation of models would blow past these shortcomings, advancing the frontier by leaps and bounds and demonstrating new capabilities that critics had previously argued were impossible. The world's leading AI players are betting that a similar pattern will play out with reinforcement learning. If recent history is any guide, it is a good bet to make. But it is important to remember that AI 'scaling laws'—which predict that AI performance increases as data, compute and model size increase—are not actually laws in any sense of that word. They are empirical observations that for a time proved reliable and predictive for pretraining language models and that have been preliminarily demonstrated in other data modalities. There is no formal guarantee that scaling laws will always hold in AI, nor how long they will last, nor how steep their slope will be. The truth is that no one knows for sure what will happen when we massively scale up RL. But we are all about to find tuned for our follow-up article on this topic—or feel free to reach out directly to discuss!Looking Forward Reinforcement learning represents a compelling approach to building machine intelligence for one profound reason: it is not bound by human competence or imagination. Training an AI model on vast troves of labeled data (supervised learning) will make the model exceptional at understanding those labels, but its knowledge will be limited to the annotated data that humans have prepared. Training an AI model on the entire internet (self-supervised learning) will make the model exceptional at understanding the totality of humanity's existing knowledge, but it is not clear that this will enable it to generate novel insights that go beyond what humans have already put forth. Reinforcement learning faces no such ceiling. It does not take its cues from existing human data. An RL agent learns for itself, from first principles, through first-hand experience. AlphaGo's 'Move 37' serves as the archetypal example here. In one of its matches against human world champion Lee Sedol, AlphaGo played a move that violated thousands of years of accumulated human wisdom about Go strategy. Most observers assumed it was a miscue. Instead, Move 37 proved to be a brilliant play that gave AlphaGo a decisive advantage over Sedol. The move taught humanity something new about the game of Go. It has forever changed the way that human experts play the game. The ultimate promise of artificial intelligence is not simply to replicate human intelligence. Rather, it is to unlock new forms of intelligence that are radically different from our own—forms of intelligence that can come up with ideas that we never would have come up with, make discoveries that we never would have made, help us see the world in previously unimaginable ways. We have yet to see a 'Move 37 moment' in the world of generative AI. It may be a matter of weeks or months—or it may never happen. Watch this space.

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store