logo
AI Sucks at Sudoku. Much More Troubling Is That It Can't Explain Why

AI Sucks at Sudoku. Much More Troubling Is That It Can't Explain Why

CNET3 hours ago
Chatbots can be genuinely impressive when you watch them do things they're good at, like writing realistic-sounding text or creating weird futuristic-looking images. But try to ask generative AI to solve one of those puzzles you find in the back of a newspaper, and things can quickly go off the rails.
That's what researchers at the University of Colorado Boulder found when they challenged different large language models to solve Sudoku. And not even the standard 9x9 puzzles. An easier 6x6 puzzle was often beyond the capabilities of an LLM without outside help (in this case, specific puzzle-solving tools).
The more important finding came when the models were asked to show their work. For the most part, they couldn't. Sometimes they lied. Sometimes they explained things in ways that made no sense. Sometimes they hallucinated and started talking about the weather.
If gen AI tools can't explain their decisions accurately or transparently, that should cause us to be cautious as we give these things more and more control over our lives and decisions, said Ashutosh Trivedi, a computer science professor at the University of Colorado at Boulder and one of the authors of the paper published in July in the Findings of the Association for Computational Linguistics.
"We would really like those explanations to be transparent and be reflective of why AI made that decision, and not AI trying to manipulate the human by providing an explanation that a human might like," Trivedi said.
When you make a decision, you can at least try to justify it or explain how you arrived at it. That's a foundational component of society. We are held accountable for the decisions we make. An AI model may not be able to accurately or transparently explain itself. Would you trust it?
Why LLMs struggle with Sudoku
We've seen AI models fail at basic games and puzzles before. OpenAI's ChatGPT (among others) has been totally crushed at chess by the computer opponent in a 1979 Atari game. A recent research paper from Apple found that models can struggle with other puzzles, like the Tower of Hanoi.
It has to do with the way LLMs work and fill in gaps in information. These models try to complete those gaps based on what happens in similar cases in their training data or other things they've seen in the past. With a Sudoku, the question is one of logic. The AI might try to fill each gap in order, based on what seems like a reasonable answer, but to solve it properly, it instead has to look at the entire picture and find a logical order that changes from puzzle to puzzle.
Read more: AI Essentials: 29 Ways You Can Make Gen AI Work for You, According to Our Experts
Chatbots are bad at chess for a similar reason. They find logical next moves but don't necessarily think three, four or five moves ahead. That's the fundamental skill needed to play chess well. Chatbots also sometimes tend to move chess pieces in ways that don't really follow the rules or put pieces in meaningless jeopardy.
You might expect LLMs to be able to solve Sudoku because they're computers and the puzzle consists of numbers, but the puzzles themselves are not really mathematical; they're symbolic. "Sudoku is famous for being a puzzle with numbers that could be done with anything that is not numbers," said Fabio Somenzi, a professor at CU and one of the research paper's authors.
I used a sample prompt from the researchers' paper and gave it to ChatGPT. The tool showed its work, and repeatedly told me it had the answer before showing a puzzle that didn't work, then going back and correcting it. It was like the bot was turning in a presentation that kept getting last-second edits: This is the final answer. No, actually, never mind, this is the final answer. It got the answer eventually, through trial and error. But trial and error isn't a practical way for a person to solve a Sudoku in the newspaper. That's way too much erasing and ruins the fun.
AI and robots can be good at games if they're built to play them, but general-purpose tools like large language models can struggle with logic puzzles.
Ore Huiying/Bloomberg via Getty Images
AI struggles to show its work
The Colorado researchers didn't just want to see if the bots could solve puzzles. They asked for explanations of how the bots worked through them. Things did not go well.
Testing OpenAI's o1-preview reasoning model, the researchers saw that the explanations -- even for correctly solved puzzles -- didn't accurately explain or justify their moves and got basic terms wrong.
"One thing they're good at is providing explanations that seem reasonable," said Maria Pacheco, an assistant professor of computer science at CU. "They align to humans, so they learn to speak like we like it, but whether they're faithful to what the actual steps need to be to solve the thing is where we're struggling a little bit."
Sometimes, the explanations were completely irrelevant. Since the paper's work was finished, the researchers have continued to test new models released. Somenzi said that when he and Trivedi were running OpenAI's o4 reasoning model through the same tests, at one point, it seemed to give up entirely.
"The next question that we asked, the answer was the weather forecast for Denver," he said.
(Disclosure: Ziff Davis, CNET's parent company, in April filed a lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)
Explaining yourself is an important skill
When you solve a puzzle, you're almost certainly able to walk someone else through your thinking. The fact that these LLMs failed so spectacularly at that basic job isn't a trivial problem. With AI companies constantly talking about "AI agents" that can take actions on your behalf, being able to explain yourself is essential.
Consider the types of jobs being given to AI now, or planned for in the near future: driving, doing taxes, deciding business strategies and translating important documents. Imagine what would happen if you, a person, did one of those things and something went wrong.
"When humans have to put their face in front of their decisions, they better be able to explain what led to that decision," Somenzi said.
It isn't just a matter of getting a reasonable-sounding answer. It needs to be accurate. One day, an AI's explanation of itself might have to hold up in court, but how can its testimony be taken seriously if it's known to lie? You wouldn't trust a person who failed to explain themselves, and you also wouldn't trust someone you found was saying what you wanted to hear instead of the truth.
"Having an explanation is very close to manipulation if it is done for the wrong reason," Trivedi said. "We have to be very careful with respect to the transparency of these explanations."
Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

Sticker Mule partners with xAI to deploy Grok to 370 employees and millions of customers
Sticker Mule partners with xAI to deploy Grok to 370 employees and millions of customers

Yahoo

time20 minutes ago

  • Yahoo

Sticker Mule partners with xAI to deploy Grok to 370 employees and millions of customers

Sticker Mule partners with xAI to deploy Grok to 370 employees and millions of customers AMSTERDAM, N.Y., Aug. 06, 2025 (GLOBE NEWSWIRE) -- Sticker Mule will bring Grok to 370 employees and millions of customers worldwide. Implementation begins in August with Grok being deployed across all departments including engineering, marketing, finance, support and manufacturing. 'Everyone with an email at Sticker Mule is getting Grok, including many front line factory employees,' said Sticker Mule CEO, Anthony Constantino. 'We wanted to be among the first to implement Grok company-wide as we most trust xAI's capabilities.' The company will also begin implementing Grok-powered customer support and utilize Grok as its business brain to enhance decision-making. 'AI is going to beat up pure software businesses, but it will help innovators that sell physical products like Sticker Mule,' Constantino said. 'AI can't manufacture, but it will help innovators, like Sticker Mule, to reach feature parity with incumbent software platforms.' Sticker Mule recently entered the e-commerce market by launching Stores which has already helped sellers earn nearly $200,000 while in beta. Constantino expects AI to accelerate development of Stores and position it to challenge incumbents like Shopify. To supplement Stores, the company is also building a suite of business tools to further help its sellers including Give, an automated online giveaway platform, that was launched recently. Give automates giveaways to help sellers grow their Email and SMS marketing lists. Sticker Mule is also soon to release Notify, a full blown Email and SMS marketing platform that's in beta testing and other tools will follow. 'We believe, with xAI, we can likely double sales with minimal staff growth which will enable us to significantly improve compensation for our team and store sellers,' said Constantino. About Sticker Mule Sticker Mule is the best way to buy and sell custom merchandise, including stickers, t-shirts, magnets, buttons, labels, packaging, keychains, temporary tattoos, and an award-winning hot sauce. Founded in 2010, today we are powered by 1,200+ people in 30+ countries, with factories in New York, South Carolina and Italy. Press contact Paul Antonelli press@ A photo accompanying this announcement is available at

Ultra Clean Names Chris Cook as Chief Business Officer
Ultra Clean Names Chris Cook as Chief Business Officer

Yahoo

time20 minutes ago

  • Yahoo

Ultra Clean Names Chris Cook as Chief Business Officer

HAYWARD, Calif., Aug. 6, 2025 /PRNewswire/ -- Ultra Clean Holdings, Inc. (Nasdaq: UCTT), announced today that its Board of Directors has appointed Chris Cook as Chief Business Officer of UCT effective immediately. "As President of UCT's Products Division, Chris has successfully grown our product portfolio, expanded our vertical content, deepened our customer relationships and enhanced our manufacturing leadership position across key markets," said Clarence Granger, Chairman of UCT. "In his new role as Chief Business Officer, Chris will also spearhead UCT's commercial strategy, forging deeper strategic partnerships with customers, identifying new market opportunities, and accelerating growth through an optimized portfolio of innovative products and services, accelerating our advancement in the global semiconductor market. In his new role, Chris will continue to report to the CEO." Mr. Cook joined UCT as President, Products Division in April 2022. Chris's track record includes 28 years of successful leadership and general management with semiconductor and electronic systems companies, including Renesas Technologies, Infineon Technologies, Flex, and Cypress Semiconductor. Mr. Cook specializes in driving profitable growth by developing valuable technologies and products, optimizing global operations, and solving tough problems in ways that build lasting trust with customers. In addition, he has led numerous strategic initiatives to scale and improve customer experience, employee engagement, and financial performance via the digital transformation of processes and services. Mr. Cook holds a B.S. in Electrical Engineering and Technology from Purdue University and completed the Program for Leadership Development at Harvard Business School. About Ultra Clean Holdings, Inc. Ultra Clean Holdings, Inc. is a leading developer and supplier of critical subsystems, components, parts, and ultra-high purity cleaning and analytical services primarily for the semiconductor industry. Under its Products division, UCT offers its customers an integrated outsourced solution for major subassemblies, improved design-to-delivery cycle times, design for manufacturability, prototyping, and high-precision manufacturing. Under its Services Division, UCT offers its customers tool chamber parts cleaning and coating, as well as micro-contamination analytical services. Ultra Clean is headquartered in Hayward, California. Additional information is available at Contact: Rhonda BennettoSVP Investor Relationsrbennetto@ View original content to download multimedia: SOURCE Ultra Clean Holdings, Inc.

Morgan Stanley Upgrades PT on NVIDIA from $170 to $200, Keeps Overweight Rating
Morgan Stanley Upgrades PT on NVIDIA from $170 to $200, Keeps Overweight Rating

Yahoo

time20 minutes ago

  • Yahoo

Morgan Stanley Upgrades PT on NVIDIA from $170 to $200, Keeps Overweight Rating

NVIDIA Corporation (NASDAQ:NVDA) is one of the top stocks that Grok recommended. On July 30, Morgan Stanley upgraded the price target on NVIDIA Corporation (NASDAQ:NVDA) from $170 to $200, keeping its Overweight rating on the stock. Joseph Moore from Morgan Stanley reiterated his rating on NVDA with a price increase as the analyst sees further gains in the coming months, driven by the AI strength, as the supply and demand continue to soar. Moore mentioned that the planned Blackwell ramp for both processors and connectivity in the second half of 2025 will fuel the next phase of growth for Nvidia. Moore's bullish call arrives ahead of Nvidia's earnings release scheduled on August 27. Wall Street expects NVDA to post earnings per share of $1 and quarterly revenue of around $45.68 billion. The analyst believes that the supply bottlenecks will continue to set the pace for growth and accelerate the momentum for earnings revisions. NVIDIA Corporation (NASDAQ:NVDA) is a full-stack computing infrastructure company. The company is leading the AI revolution with accelerated computing to help solve the challenging computational problems. While we acknowledge the potential of NVDA as an investment, we believe certain AI stocks offer greater upside potential and carry less downside risk. If you're looking for an extremely undervalued AI stock that also stands to benefit significantly from Trump-era tariffs and the onshoring trend, see our free report on the best short-term AI stock. READ NEXT: 30 Stocks That Should Double in 3 Years and 11 Hidden AI Stocks to Buy Right Now. Disclosure: None. This article is originally published at Insider Monkey. Error in retrieving data Sign in to access your portfolio Error in retrieving data Error in retrieving data Error in retrieving data Error in retrieving data

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store