logo
AI Took on the Math Olympiad—But Mathematicians Aren't Impressed

AI Took on the Math Olympiad—But Mathematicians Aren't Impressed

A defining memory from my senior year of high school was a nine-hour math exam with just six questions. Six of the top scorers won slots on the U.S. team for the International Math Olympiad (IMO), the world's longest running math competition for high school students. I didn't make the cut, but became a tenured mathematics professor anyway.
This year's olympiad, held last month on Australia's Sunshine Coast, had an unusual sideshow. While 110 students from around the world went to work on complex math problems using pen and paper, several AI companies quietly tested new models in development on a computerized approximation of the exam. Right after the closing ceremonies, OpenAI and later Google DeepMind announced that their models earned (unofficial) gold medals for solving five of the six problems. Researchers like Sébastien Bubeck of OpenAI celebrated these models' successes as a ' moon landing moment ' by industry.
But are they? Is AI going to replace professional mathematicians? I'm still waiting for the proof.
On supporting science journalism
If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
The hype around this year's AI results is easy to understand because the olympiad is hard. To wit, in my senior year of high school, I set aside calculus and linear algebra to focus on olympiad-style problems, which were more of a challenge. Plus the cutting-edge models still in development did so much better at the exam than the commercial models already out there. In a parallel contest administered by MathArena.ai, Gemini 2.5 pro, Grok 4, o3 high, o4-mini high and DeepSeek R1 all failed to produce a single completely correct solution. It shows that AI models are getting smarter, their reasoning capabilities improving rather dramatically.
Yet I'm still not worried.
The latest models just got a good grade on a single test—as did many of the students—and a head-to-head comparison isn't entirely fair. The models often employ a 'best-of- n ' strategy, generating multiple solutions and then grading themselves to select the strongest. This is akin to having several students work independently, then get together to pick the best solution and submit only that one. If the human contestants were allowed this option, their scores would likely improve too.
Other mathematicians are similarly cautioning against the hype. IMO gold medalist Terence Tao (currently a mathematician at the University of California, Los Angeles) noted on Mastodon that what AI can do depends on what the testing methodology is. IMO president Gregor Dolinar said that the organization ' cannot validate the methods [used by the AI models], including the amount of compute used or whether there was any human involvement, or whether the results can be reproduced.'
Besides, IMO exam questions don't compare to the kinds of questions professional mathematicians try to answer, where it can take nine years, rather than nine hours, to solve a problem at the frontier of mathematical research. As Kevin Buzzard, a mathematics professor at Imperial College London, said in an online forum, 'When I arrived in Cambridge UK as an undergraduate clutching my IMO gold medal I was in no position to help any of the research mathematicians there.'
These days, mathematical research can take more than one lifespan to acquire the right expertise. Like many of my colleagues, I've been tempted to try 'vibe proving'—having a math chat with an LLM as one would with a colleague, asking 'Is it true that...' followed by a technical mathematical conjecture. The chatbot often then supplies a clearly articulated argument that, in my experience, tends to be correct when it comes to standard topics but subtly wrong at the cutting edge. For example, every model I've asked has made the same subtle mistake in assuming that the theory of idempotents behaves the same for weak infinite-dimensional categories as it does for ordinary ones, something that human experts (trust me on this) in my field know to be false.
I'll never trust an LLM—which at its core is just predicting what text will come next in a string of words, based on what's in its dataset—to provide a mathematical proof that I can't verify myself.
The good news is, we do have an automated mechanism for determining whether proofs can be trusted. Relatively recent tools called 'proof assistants' are software programs (they don't use AI) designed to check whether a logical argument proves the stated claim. They are increasingly attracting attention from mathematicians like Tao, Buzzard and myself who want more assurance that our own proofs are correct. And they offer the potential to help democratize mathematics and even improve AI safety.
Suppose I received a letter, in unfamiliar handwriting, from Erode, a city in Tamil Nadu, India, purporting to contain a mathematical proof. Maybe its ideas are brilliant, or maybe they're nonsensical. I'd have to spend hours carefully studying every line, making sure the argument flowed step-by-step, before I'd be able to determine whether the conclusions are true or false.
But if the mathematical text were written in an appropriate computer syntax instead of natural language, a proof assistant could check the logic for me. A human mathematician, such as I, would then only need to understand the meaning of the technical terms in the theorem statement. In the case of Srinivasa Ramanujan, a generational mathematical genius who did hail from Erode, an expert did take the time to carefully decipher his letter. In 1913 Ramanujan wrote to the British mathematician G. H. Hardy with his ideas. Luckily, Hardy recognized Ramanujan's brilliance and invited him to Cambridge to collaborate, launching the career of one of the all-time mathematical 'greats.'
What's interesting is that some of the AI IMO contestants submitted their answers in the language of the Lean computer proof assistant so that the computer program could automatically check for errors in their reasoning. A start-up called Harmonic posted formal proofs generated by their model for five of the six problems, and ByteDance achieved a silver-medal level performance by solving four of the six problems. But the questions had to be written to accommodate the models' language limitations, and they still needed days to figure it out.
Still, formal proofs are uniquely trustworthy. While so-called 'reasoning' models are prompted to break problems down into pieces and explain their 'thinking' step by step, the output is as likely to produce an argument that sounds logical but isn't, as to constitute a genuine proof. By contrast, a proof assistant will not accept a proof unless it is fully precise and fully rigorous, justifying every step in its chain-of-thought. In some circumstances, a hand-waving or approximate solution is good enough, but when mathematical accuracy matters, we should demand that AI-generated proofs are formally verifiable.
Not every application of generative AI is so black and white, where humans with the right expertise can determine whether the results are correct or incorrect. In life, there is a lot of uncertainty and it's easy to make mistakes. As I learned in high school, one of the best things about math is the fact that you can prove definitively that some ideas are wrong. So I'm happy to have an AI try to solve my personal math problems, but only if the results are formally verifiable. And we aren't quite there, yet.
Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

David Sacks' "Goldilocks" scenario
David Sacks' "Goldilocks" scenario

Axios

time28 minutes ago

  • Axios

David Sacks' "Goldilocks" scenario

David Sacks — famed tech founder & investor, co-host of the "All-In Podcast" ("The Rainman") and now White House special adviser for AI & crypto — declared Saturday on X that "Doomer narratives were wrong" about AI. Why it matters: The big takeaway from Sacks' post is that the fear of one AI frontier model dominating all now looks far-fetched, since high-performing competing models are diffusing power. "The AI race is highly dynamic so this could change," Sacks wrote. "But right now the current situation is Goldilocks": "We have 5 major American companies vigorously competing on frontier models. This brings out the best in everyone and helps America win the AI race." "So far, we have avoided a monopolistic outcome that vests all power and control in a single entity." "There is likely to be a major role for open source. These models excel at providing 80-90% of the capability at 10-20% of the cost. This tradeoff will be highly attractive to customers who value customization, control, and cost over frontier capabilities. China has gone all-in on open source, so it would be good to see more American companies competing in this area, as OpenAI just did. (Meta also deserves credit.)" "There is likely to be a division of labor between generalized foundation models and specific verticalized applications. Instead of a single superintelligence capturing all the value, we are likely to see numerous agentic applications solving 'last mile' problems. This is great news for the startup ecosystem." "There is also an increasingly clear division of labor between humans and AI. Despite all the wondrous progress, AI models are still at zero in terms of setting their own objective function. Models need context, they must be heavily prompted, the output must be verified, and this process must be repeated iteratively to achieve meaningful business value. Reality check: Though there's plenty of competition, as Sacks notes, the administration has worked closely with OpenAI on the Stargate Project. And as Axios' Scott Rosenberg noted recently, the history of tech shows that early dominant players could still fall to a competitor not yet born: Google came along to dominate what seemed like a well-established search industry in the early dot-com era. The bottom line: The "current state of vigorous competition is healthy," Sacks concluded. "It propels innovation forward, helps America win the AI race, and avoids centralized control. This is good news — that the Doomers did not expect."

Gen AI is coming for online checkout in seismic shift for internet shopping
Gen AI is coming for online checkout in seismic shift for internet shopping

CNBC

time2 hours ago

  • CNBC

Gen AI is coming for online checkout in seismic shift for internet shopping

In the world of online payments, the buzzword these days is frictionless, as companies entice customers with increasingly convenient online payment options. Think the online equivalent of a biometric handwave: the fewer steps involved, the more customers will buy, experts say. With that aim in mind, the growing presence of generative AI such as ChatGPT in the shopping ecosystem may next take a shape that has a seismic impact on the economic model of internet retail. Currently, a search for a large London Fog men's coat on ChatGPT will bring up a purchase option, but you still have to click over to London Fog's site to complete the transaction. But that may soon change. Gen AI search engine Perplexity already has a deal with PayPal for this kind of function, allowing online shoppers to make purchases such as concert tickets and travel directly in chat, with payments completed with PayPal or Venmo, and PayPal handling the processing, shipping, tracking, and invoicing. The Financial Times reported last month that an integrated checkout system is coming to ChatGPT, with partners such as Shopify, so that users can complete the transaction within the platform. Merchants will pay a commission to OpenAI. OpenAI and Shopify have not confirmed the plans. The AI company has already worked with Shopify on an AI assistant for internet sales. OpenAI has already rolled out several features designed to enhance shopping and other consumer experiences, and experts say that in one form or another, this further use of gen AI in retail's future should be expected, and companies need to be planning for the consequences today. "Enabling customers to purchase without leaving the chat will have a significant impact on the sales cycle," said Elizabeth Perkins, professor of practice of business administration and economics at Roanoke College. But she added that from a marketing perspective, any time you consolidate or eliminate a part of the purchase cycle, you get your customer that much closer to spending money. "No more time-consuming steps. Customers get what they want faster, with less hassle, and honestly, with less chance of changing their mind," Perkins said. Perikins says that payment interfaces like Venmo or Apple Pay could be disrupted. But Paul McAdam, J.D. Power's senior director of banking intelligence, says he thinks that while AI checkout capabilities will disrupt the checkout ecosystem, big players will find a way to stay in the game. "This is one more competitor looking for a slice of the pie. It will be fascinating to see how banks react to this. There will be a shakeout," but he added, "PayPal, Apple, and Google are pretty entrenched, so I don't think they are going anywhere. This will affect the upstarts the most, some of which will probably get gobbled up by larger competitors." At eBay, AI checkout is not being viewed as a threat, but as a sign of further innovation in the industry. "AI is proving great at speeding up workflows and generating ideas. We believe the true magic lies in blending that speed and scale with trusted, expert, enthusiast communities," said Blair Ethington, vice president of the company's focus categories and buyer experience division. Ethington says that eBay will be investing heavily in an increased AI experience at checkout that delivers real-time, hyper-personalized product picks and guidance tuned to each buyer's shopping preferences. "Ultimately, we believe our scale, unique inventory, and trusted community position us extremely well to thrive in an agentic commerce future," Ethington said. PayPal pointed to its deal with Perplexity as a sign of its embrace of the gen AI feature rather than viewing it as a threat. "We're partnering with some of the biggest players in this space, like Perplexity, to deliver personalized, secure, and seamless payments and commerce experiences to our network of more than 400 million consumers and merchants worldwide," said a PayPal spokesperson. In one form or another, the checkout experience online is ripe for change, according to Dee Waddell, global managing director of travel & transportation industries at IBM. "Consumers are no longer satisfied with the status quo," said Waddell, citing a recent IBM study that shows only 14% of consumers say they're satisfied with their e-commerce experience. "They're demanding seamless, hyper-personalized experiences across every touchpoint, and retailers must lean into AI to meet those expectations," Waddell said. IBM's retail clients are prioritizing AI to deliver a seamless experience throughout the entire customer journey. "I believe we're on the cusp of a revolution, where generative AI platforms like ChatGPT will act as 'personal assistants' to streamline the digital shopping experience from product discovery to fulfillment," Waddell said. The online shopping experience of the future could feature a consumer going into ChatGPT or Anthropic's Claude to ask for gift ideas. Then, Waddell says, without ever leaving the chat, they'll receive personalized recommendations, confirm their payment method easily, verify their shipping address, and complete the purchase. "In this new model, the AI personal assistant becomes the marketplace. This means retailers will need to solidify their ecosystem and channel strategies," Waddell said, adding that companies will either need to partner with channel providers that want to work with generative AI companies or go directly to AI providers themselves. "This will be a key part of staying relevant and delivering the integrated shopping experiences that consumers are demanding," Waddell said. Alex Graf, co-founder and CEO of digital commerce platform Spryker and author ofThe E-Commerce Book," says this is the beginning of a seismic shift in how people will shop online. But he says the focus on the threat to payment processors misses the biggest disruptive outcome from the shift. "We're witnessing a structural shift in the e-commerce value chain, and ChatGPT is right at the center of it. The old game was about who could close the sale. The new game is about who controls the pre-sale and gets the user's attention first," Graf said, adding that this is where the LLMs will disrupt the incumbents. Increasingly, Graf says, it will be ChatGPT and Claude that will guide discovery, curate choices, and compress decision-making for customers. "For ecommerce players who've relied on 'watch time' to keep users browsing, like Amazon, Etsy, or even Shopify storefronts, this is an existential threat," Graf said. "Users who previously started their search on e-commerce platforms are now starting with AI. And here's the kicker: selling the product isn't where the margin is anymore. It's selling the eyeballs," Graf said, noting that ads that appear inside ecommerce ecosystems have become a $50 billion market globally. "Amazon's most profitable business isn't Prime or AWS, it's retail media," Graf said. Amazon is investing heavily in its own generative AI features for shopping and has made significant investments in gen AI companies including Anthropic. Graf says that when a gen AI like ChatGPT becomes the "new homepage" of commerce, it doesn't need to sell products directly. "It just needs to own the watch time, then monetize via paid placements, affiliate links, and product recommendations, just like Amazon does internally. This re-routes billions in potential retail media revenue toward whoever owns that conversational layer," Graf said. Amazon's ads business has been a bright spot within earnings in recent quarters. Graf says fintech players like PayPal or Zelle, the online payment app run by a consortium of major banks, will be impacted, but indirectly. Once ChatGPT, or similar AI agents, corner end-to-end shopping, there's less space for third-party payment tools. "Integrated systems or AI-native wallets may eat that slice of the pie over time," Graf said. The ultimate winners, he says, will be those who can monetize attention and intention, not just transactions.

The next big AI model is here
The next big AI model is here

The Verge

time2 hours ago

  • The Verge

The next big AI model is here

Hi, friends! Welcome to Installer No. 93, your guide to the best and Verge-iest stuff in the world. (If you're new here, welcome, I'm sad the sun is setting sooner, and also you can read all the old editions at the Installer homepage.) This week, I'm reading about the rise of GEO, laughing at Kirby's new shapes, acknowledging Google's good dunk on Apple, thinking of the best Wordle puzzle I can make, wondering if I'll ever see Microsoft's Windows XP-themed Crocs out in the wild, following the progress of The Bluesky Dictionary, and watching Antoni Porowski's excellent Architectural Digest Open Door episode. I also have for you some AI news from OpenAI, a bug-filled new season of Fortnite, a Site of the Year competition, and more. (As always, the best part of Installer is your ideas and tips. What do you want to know more about? What awesome tricks do you know that everyone else should? What app should everyone be using? Tell me everything: [email protected]. And if you know someone else who might enjoy Installer, forward it to them and tell them to subscribe here.) Today, I'm featuring Siri Ramos, the founder of Mechanism, which makes mounts and grips for handheld gaming devices like the Steam Deck and the Nintendo Switch 2. I unexpectedly met Siri at a recent Triple Click live recording, and I reached out because I thought he might have an interesting homescreen. My hunch was right — I'll let him share more: 'A few months ago I realized how much time I was wasting on my phone and decided to make it less appealing and more functional. Instead of buying a minimal phone like the Light Phone, I decided to convert my iPhone 14 Pro into a dopamine-reduced version to test. I also use a matte screen protector to complete the Light phone look, which is surprisingly cool. The phone: iPhone 14 Pro The wallpaper: A boring grey to mimic the Light Phone The lockscreen apps: The main homescreen apps: The homescreen apps on page two: The docked apps: I used a minimal black icon pack mixed with Siri shortcuts, except for the Superhuman email app. I also asked Siri to share a few things he's into right now. Here's what he said: Here's what the Installer community is into this week. I want to know what you're into right now as well! Email [email protected] with your recommendations for anything and everything, and we'll feature some of our favorites here every week. For even more great recommendations, check out the replies to this post on The Verge, this post on Threads, and this post on Bluesky. 'For the first time in a few years I'm playing around with some embedded programming platforms, specifically Arduino and the Pi Pico. It's amazing that I can get a computer more powerful than the desktop I owned in high school for less than ten dollars.' — Matthew 'Just got my new CRKD guitar, loving it, and diving back into Clone Hero and YARG, the modern open source versions of Guitar Hero and Rock Band. And great timing, Red Octane is back to make a new rhythm game!' — Bruno 'I'm deeply impressed by the Lord of the Rings audiobook by Andy Serkis. It's insane how differently the characters sound (and how similar to movies).' — Jakub 'This week I took control of my RSS reader by moving from Inoreader to a self-hosted Miniflux instance. It was actually super easy to get working, and it's cool to be able to use different third-party native apps again. It feels like I'm back in the good ol' Google Reader days again.' — gnu_slash_dhruv 'I've been watching Superman: The Animated Series on HBO Max. I love how the series makes the most of its sci-fi premise, alongside compelling cinematography and stirring music. Highly recommend!' — Blue Savoy 'This video from Julian O'Shea about cars getting huge is extremely well done, and the part about pedestrian fatalities is legitimately moving. If you're in need of lighter fare, CityNerd's new video about the Vegas Loop being pathetically stupid is a great follow up piece.' — cowboyxboombap 'I use a mostly-QWERTY layout, but have a ZSA Moonlander keyboard and have customized it to my programmer/game tastes. I can actually touch-type now with this keyboard. It folds up so well that I travel with it all the time, and it gets comments every time I use it. — Ron 'I'd like to recommend Apocalypse Hotel! It's a charming anime about a group of robots running a hotel long after humanity left the planet. It's got a lovely bittersweet but optimistic mood to it as they wait patiently for humanity to return. And it's quite funny as well.' — Graham Thanks to everyone who wrote in their stories of alternate keyboard layouts. There's a fair few Dvorak users out there, but I heard from a fellow Colemak typist and even a Norman user! If you're interested in trying an alternate keyboard, I actually recommend the challenge — it's a fun way to re-wire your computer brain. (And maybe give yourself a chance to fix some bad typing habits.) Also thank you very much to Troy, who, in response to my comment about considering a TV on wheels, shared the setup he uses. 'If you are interested in a rolling TV, what we did was purchase a rolling TV stand from Amazon for $99, a 43-inch TCL Roku TV, and a $20 cover. Works great, we use it all over the house and backyard.' If I had more closet space to 'store' our TV when we weren't using it, I would have ordered a cart and a cover yesterday. Someday, though, I'm sure I'm going to end up trying to live this life. See you next week! Posts from this author will be added to your daily email digest and your homepage feed. See All by Jay Peters Posts from this topic will be added to your daily email digest and your homepage feed. See All Installer Posts from this topic will be added to your daily email digest and your homepage feed. See All Tech

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store