At Secret Math Meeting, Researchers Struggle to Outsmart AI

On a weekend in mid-May, a clandestine mathematical conclave convened. Thirty of the world's most renowned mathematicians traveled to Berkeley, Calif., with some coming from as far away as the U.K. The group's members faced off in a showdown with a 'reasoning' chatbot that was tasked with solving problems the had devised to test its mathematical mettle. After throwing professor-level questions at the bot for two days, the researchers were stunned to discover it was capable of answering some of the world's hardest solvable problems. 'I have colleagues who literally said these models are approaching mathematical genius,' says Ken Ono, a mathematician at the University of Virginia, who attended the meeting.
The chatbot in question is powered by o4-mini, a so-called reasoning large language model (LLM). It was trained by OpenAI to be capable of making highly intricate deductions. Google's equivalent, Gemini 2.5 Flash, has similar abilities. Like the LLMs that powered earlier versions of ChatGPT, o4-mini learns to predict the next word in a sequence. Compared with those earlier LLMs, however, o4-mini and its equivalents are lighter-weight, more nimble models that train on specialized datasets with stronger reinforcement from humans. The approach leads to a chatbot capable of diving much deeper into complex problems in math than traditional LLMs.
To track the progress of o4-mini, OpenAI previously tasked Epoch AI, a nonprofit that benchmarks LLMs, to come up with 300 math questions whose solutions had not yet been published. Even traditional LLMs can correctly answer many complicated math questions. Yet when Epoch AI asked several such models these questions, which they hadn't previously been trained on, the most successful were able to solve less than 2 percent, showing these LLMs lacked the ability to reason. But o4-mini would prove to be very different.
On supporting science journalism
If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
Epoch AI hired Elliot Glazer, who had recently finished his math Ph.D. to join the new collaboration for the benchmark, dubbed FrontierMath, in September 2024. The project collected novel questions over varying tiers of difficulty, with the first three tiers covering, undergraduate-, graduate- and research-level challenges. By February 2025, Glazer found that o4-mini could solve around 20 percent of the questions. He then moved on to a fourth tier: 100 questions that would be challenging even for an academic mathematician. Only a small group of people in the world would be capable of developing such questions, let alone answering them. The mathematicians who participated had to sign a nondisclosure agreement to communicate solely via the messaging app Signal. Other forms of contact, such as traditional e-mail, could potentially be scanned by an LLM and inadvertently train it, thereby contaminating the dataset.
The group made slow, steady progress in finding questions. But Glazer wanted to speed things up, so Epoch AI hosted the in-person meeting on Saturday, May 17, and Sunday, May 18. There, the participants would find the final 10 challenge questions. The meeting was headed by Ono, who split the 30 attendees into groups of six. For two days, the academics competed against themselves to devise problems that they could solve but would trip up the AI reasoning bot. Any problems the o4-mini couldn't solve would garner the mathematician who came up with them a $7,500 reward.
By the end of that Saturday night, Ono was frustrated with the team's lack of progress. 'I came up with a problem which everyone in my field knows to be an open question in number theory—a good Ph.D.-level problem,' he says. He asked o4-mini to solve the question. Over the next 10 minutes, Ono watched in stunned silence as the bot unfurled a solution in real time, showing its reasoning process along the way. The bot spent the first two minutes finding and mastering the related literature in the field. Then it wrote on the screen that it wanted to try solving a simpler 'toy' version of the question first in order to learn. A few minutes later, it wrote that it was finally prepared to solve the more difficult problem. Five minutes after that, o4-mini presented a correct but sassy solution. 'It was starting to get really cheeky,' says Ono, who is also a freelance mathematical consultant for Epoch AI. 'And at the end, it says, 'No citation necessary because the mystery number was computed by me!''
Defeated, Ono jumped onto Signal that night and alerted the rest of the participants. 'I was not prepared to be contending with an LLM like this,' he says, 'I've never seen that kind of reasoning before in models. That's what a scientist does. That's frightening.'
Although the group did eventually succeed in finding 10 questions that stymied the bot, the researchers were astonished by how far AI had progressed in the span of one year. Ono likened it to working with a 'strong collaborator.' Yang Hui He, a mathematician at the London Institute for Mathematical Sciences and an early pioneer of using AI in math, says, 'This is what a very, very good graduate student would be doing—in fact, more.'
The bot was also much faster than a professional mathematician, taking mere minutes to do what it would take such a human expert weeks or months to complete.
While sparring with o4-mini was thrilling, its progress was also alarming. Ono and He express concern that the o4-mini's results might be trusted too much. 'There's proof by induction, proof by contradiction, and then proof by intimidation,' He says. 'If you say something with enough authority, people just get scared. I think o4-mini has mastered proof by intimidation; it says everything with so much confidence.'
By the end of the meeting, the group started to consider what the future might look like for mathematicians. Discussions turned to the inevitable 'tier five'—questions that even the best mathematicians couldn't solve. If AI reaches that level, the role of mathematicians would undergo a sharp change. For instance, mathematicians may shift to simply posing questions and interacting with reasoning-bots to help them discover new mathematical truths, much the same as a professor does with graduate students. As such, Ono predicts that nurturing creativity in higher education will be a key in keeping mathematics going for future generations.
'I've been telling my colleagues that it's a grave mistake to say that generalized artificial intelligence will never come, [that] it's just a computer,' Ono says. 'I don't want to add to the hysteria, but these large language models are already outperforming most of our best graduate students in the world.'

Hashtags

Science

#UniversityofVirginia

#KenOno

#ElliotGlazer

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Vibe coding lets anyone write software—but comes with risks

Fast Company

an hour ago

Fast Company

Vibe coding lets anyone write software—but comes with risks

Whether you're streaming a show, paying bills online or sending an email, each of these actions relies on computer programs that run behind the scenes. The process of writing computer programs is known as coding. Until recently, most computer code was written, at least originally, by human beings. But with the advent of generative artificial intelligence, that has begun to change. Just as you can ask ChatGPT to spin up a recipe for a favorite dish or write a sonnet in the style of Lord Byron, now you can ask generative AI tools to write computer code for you. Andrej Karpathy, an OpenAI co-founder who previously led AI efforts at Tesla, recently termed this ' vibe coding.' For complete beginners or nontechnical dreamers, writing code based on vibes—feelings rather than explicitly defined information—could feel like a superpower. You don't need to master programming languages or complex data structures. A simple natural language prompt will do the trick. How it works Vibe coding leans on standard patterns of technical language, which AI systems use to piece together original code from their training data. Any beginner can use an AI assistant such as GitHub Copilot or Cursor Chat, put in a few prompts, and let the system get to work. Here's an example: 'Create a lively and interactive visual experience that reacts to music, user interaction, or real-time data. Your animation should include smooth transitions and colorful and lively visuals with an engaging flow in the experience. The animation should feel organic and responsive to the music, user interaction, or live data and facilitate an experience that is immersive and captivating. Complete this project using JavaScript or React, and allow for easy customization to set the mood for other experiences.' But AI tools do this without any real grasp of specific rules, edge cases, or security requirements for the software in question. This is a far cry from the processes behind developing production-grade software, which must balance trade-offs between product requirements, speed, scalability, sustainability, and security. Skilled engineers write and review the code, run tests, and establish safety barriers before going live. But while the lack of a structured process saves time and lowers the skills required to code, there are trade-offs. With vibe coding, most of these stress-testing practices go out the window, leaving systems vulnerable to malicious attacks and leaks of personal data. And there's no easy fix: If you don't understand every—or any—line of code that your AI agent writes, you can't repair the code when it breaks. Or worse, as some experts have pointed out, you won't notice when it's silently failing. The AI itself is not equipped to carry out this analysis either. It recognizes what 'working' code usually looks like, but it cannot necessarily diagnose or fix deeper problems that the code might cause or exacerbate. Why it matters Vibe coding could be just a flash-in-the-pan phenomenon that will fizzle before long, but it may also find deeper applications with seasoned programmers. The practice could help skilled software engineers and developers more quickly turn an idea into a viable prototype. It could also enable novice programmers or even amateur coders to experience the power of AI, perhaps motivating them to pursue the discipline more deeply. Vibe coding also may signal a shift that could make natural language a more viable tool for developing some computer programs. If so, it would echo early website editing systems known as WYSIWYG editors that promised designers 'what you see is what you get,' or 'drag-and-drop' website builders that made it easy for anyone with basic computer skills to launch a blog. For now, I don't believe that vibe coding will replace experienced software engineers, developers, or computer scientists. The discipline and the art are much more nuanced than what AI can handle, and the risks of passing off 'vibe code' as legitimate software are too great. But as AI models improve and become more adept at incorporating context and accounting for risk, practices like vibe coding might cause the boundary between AI and human programmer to blur further.

How To Build Your Own Generative AI Toolkit To Stay Ahead At Work

Forbes

2 hours ago

Forbes

How To Build Your Own Generative AI Toolkit To Stay Ahead At Work

How To Build Your Own Generative AI Toolkit To Stay Ahead At Work If you are waiting for your company to adopt AI, you are missing out on a personal opportunity to get ahead in your career. With generative AI you can build your own workflow, automate routine tasks, or create more compelling content. It doesn't take a technical background, but it helps if you have curiosity. The first question you should ask is: what is generative AI and what is a generative toolkit? Generative AI is a fancy term for artificial intelligence that creates new things by learning from existing data and using patterns to generate something original. A generative AI toolkit a collection of tools you can use to create new things like text, images, videos, or audio using AI. What Should Be In A Generative AI Toolkit To Help You Work Smarter? I use many tools that help me get things done more efficiently, and I encourage others to do the same. Don't let the sound of the following AI-related terms intimidate you. Once you start using them, they're easier than you might expect. Here are some of the tasks where I've found generative AI most useful, along with the tools that can help you do each one more effectively so you can be more productive and make a stronger impression. • Brainstorming: Use ChatGPT to generate ideas, outlines, and scripts for any content format. • Audio Creation: ElevenLabs can replicate your voice to narrate scripts. If you record your own voice or anything else, Adobe Podcast Enhance cleans up the audio and makes it studio-quality. • Visual Creation: Kling creates video clips quickly and inexpensively. Canva helps you design everything from presentations to videos to graphics. If you have a lot of expertise, you can use Photoshop as well. • Video Editing: Use Camtasia for editing and Submagic to add b-roll (video clips), captions, and supporting visuals. These platforms don't require a production team. Many are low-cost subscriptions or pay-per-use, making it easy to experiment and find your best-fit tools. How Can A Generative AI Toolkit Make You A More Effective Communicator? If you present ideas to clients, teams, or students, generative AI tools can help you create stronger, more polished content. I use them to work faster and communicate more clearly. Captions, visuals, and sound quality all contribute to how a message is received. Captions are essential, especially on mobile devices or in quiet settings. Submagic handles that seamlessly and can add visual enhancements to keep viewers engaged. Using ElevenLabs to adjust tone and pacing also improves how your message lands. These tools allow you to focus on the substance of your message while still producing something visually and audibly appealing. How Affordable Is It To Build Your Own Generative AI Toolkit? You don't need to invest thousands. Most of the tools I've mentioned are affordable and flexible. Some charge monthly. Others charge per project. Camtasia and Canva are widely used and offer significant value. Many people underestimate what Canva can do until they explore it. When I wanted to learn more, I took a short course through for under one hundred dollars. I have no affiliation with them (or any of the other tools I mention here), but the course was far more useful than a recent graduate-level university certification I completed from one of the top technology schools. That program cost thousands and didn't include real-world applications or hands-on training. How Does Curiosity Help You Get The Most From Your AI Toolkit? Learning how to use AI tools starts with curiosity. You don't need to understand every feature. You just need to be open to trying something new. People often wait until they feel completely prepared. That delay is what slows progress. I recently attended an event hosted by HRNxt, where we discussed how hard it can be to adopt new technologies. Jessica Hanan, Head of Workforce Enablement at Altruistic, told a story that captured the problem well. When cars were first introduced, some had fake horse heads attached to the front to make passengers feel more comfortable. We're in a similar place now with AI. People need help getting past their initial discomfort. One simple way to make adoption easier is to divide learning across a team. Assign one person to experiment with ChatGPT for scripting. Another can test ElevenLabs for voiceover. A third can use Adobe Podcast Enhance for audio quality. Someone else can explore for visuals. Make the group goal a final video project. That structure gives everyone a role and makes learning more purposeful. How Do You Know When Your Generative AI Toolkit Is Working? You'll know it's working when your process feels smoother. Maybe you spend less time on repetitive tasks or feel more confident creating something that used to take hours. You don't need dozens of tools. Just a few that work well for you. Once people get started, they tend to personalize their stack. One person might use their toolkit for presentations. Another may use it to create educational materials or social content. The point is to start building your own system that supports your work. What's The Best Way To Get Started With A Generative AI Toolkit? Start with one real task at work that takes too long or could be better. Choose one tool to improve that task. If you need clearer audio, try Adobe Podcast Enhance. If you want help writing, test ChatGPT. If you need short videos, explore Document what works and refine from there. This kind of simple experimentation builds your skills quickly without being overwhelming. Why Should You Build A Generative AI Toolkit Now? You don't have to wait for your company to catch up. The best time to start using AI is when you still have the space to experiment without pressure. The people gaining the most value from creating a generative AI toolkit are professionals who stay curious, take small steps, and learn by doing. This is your chance to get ahead while others hesitate. Pick one tool and share what you learn. The sooner you start, the more confident and capable you'll be when these tools become a standard part of work.

I ignored this ChatGPT setting for months — now I use it every day

Tom's Guide

3 hours ago

Tom's Guide

I ignored this ChatGPT setting for months — now I use it every day

Like millions of others, I use ChatGPT daily. As a power user, I frequently use ChatGPT to summarize research, create images and I even use the bot to talk me down from a spiral of recently, I revisited an underused setting buried in the app that completely changes how the bot responds. It's so useful that I really wish I had taken advantage of it sooner. No, it's not a secret plugin or a pro-only feature. It's something that's been there the whole time: custom instructions. You've probably seen the button dozens of times but never bothered to click on it. It lives quietly in your settings menu under the heading: 'Customize ChatGPT.' This customizes the chatbot's behavior. Using customized GPTs dramatically improved how I interact with the AI by tailoring the chatbot to my specific of it as crafting perfect assistants for certain tasks. Instead of re-explaining your preferences every time, a custom GPT can be set up to understand your job, writing style, tone and even the kind of responses you want (short, long, complex, simple).Custom GPTs also support powerful tools and integrations. You can grant them access to a code interpreter, web browsing, image generation or even custom APIs and uploaded this feature turns ChatGPT into an even better assistant capable of analyzing data, generating visuals or referencing your documents without extra work on your part. Plus, you can set behavioral instructions so the GPTs always respond in your preferred tone or format, saving you time and improving consistency. For me, the hardest part about using a customized GPT is literally remembering to use it. Although it's just a click away, sometimes I'll dive into a prompt before I remember there's a better way to get the best you've built a custom GPT that works well, you can reuse it as often as you like or share it with others (a great asset for teams). Whether you're managing SEO, writing emails or brainstorming ideas, having a GPT fine-tuned to your process means faster, smarter output. It's a useful way to turn a general-purpose tool into a personal or professional super-assistant. If you're like me, you'll notice that when you finally start customizing GPTS, your experience with AI will shift entirely, and for the better. Your responses will feel clearer and far more personal. You won't get generic responses, but answers that fit your style and suggestions that are more engaging. Get instant access to breaking news, the hottest reviews, great deals and helpful tips. You can access this feature in just a few taps: Click your name (or the three dots in the bottom left) Tap Settings Select Custom Instructions You'll see two key fields: 'What would you like ChatGPT to know about you to provide better responses?'(Example: 'I'm a busy mom of three and want an empathetic, conversational tone that feels like I'm chatting with a friend.' 'How would you like ChatGPT to respond?'(Example: 'Use short paragraphs, avoid buzzwords, and give practical suggestions. Add a human tone, like you're texting a smart coworker.') Once you fill these out, that context is baked into every conversation. You don't have to reintroduce yourself or explain your tone again. ChatGPT just gets it. Custom instructions are convenient because they are like having a hat trick in your back pocket. Whatever issue you were having with ChatGPT earlier, such as answers feeling too generic or formal, this setting fixes turns the chatbot into something much closer to a real assistant and one that actually understands everything about you (well, as much as you feel comfortably telling it). It also means you'll spend less time rewriting responses and more time getting useful results. For example: When I asked it to write a note to the babysitter, it used formatting and tone I'd normally have to adjust. When I needed a list of birthday party locations in the area, it knew where I lived and pulled them up immediately. (This type of ultra personalization might not be for everyone, but I find it to be a time saver). And when I asked for snack ideas for the soccer team, the list actually sounded like something I'd submit (and easy enough for a busy mom to contribute); not something from a generic listicle generator. Best of all? It feels more personal without sacrificing quality responses. If you're using ChatGPT with memory enabled, custom instructions are the perfect complement. Memory helps the chatbot remember ongoing preferences and facts across conversations, while custom instructions give it a solid starting point for every new chat. Even if memory isn't your thing, these static instructions make ChatGPT far more efficient right out of the gate. This one setting changes how well ChatGPT can work for you. If you've been using the chatbot like a search engine or idea machine, custom instructions push it into a new category making it more like a personal AI assistant. You'll notice a difference when it starts answering like someone who knows your voice, your goals and how you think. Do you use custom GPTs? Let me know in the comments!