Thinking AI models collapse in face of complex problems, Apple researchers find

Just days ahead of the much-anticipated Worldwide Developer Conference (WWDC), Apple has released a study titled 'The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity', which saw researchers testing 'reasoning'; AI models such as Anthropic's Claude, OpenAI's o models, DeepSeek R1 and Google's Thinking models to see how far they can scale to replicate human reasoning. Spoiler alert — not as much, as the entire AI marketing pitch, would have you believe. Could this signal what may be in store for Apple's AI conversation ahead of the keynote?
The study questions the current standard evaluation of Large Reasoning Models (LRMs) using established mathematical and coding benchmarks, arguing they suffer from data contamination and don't reveal insights into reasoning trace structure and quality. Instead, it proposes a controlled experimental testbed using algorithmic puzzle environments. The limitations of AI benchmarking, and need to evolve, is something we had written about earlier.
'We show that state-of-the-art LRMs (e.g., o3-mini, DeepSeek-R1, Claude-3.7-Sonnet-Thinking) still fail to develop generalizable problem-solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities across different environments,' the researcher paper points out. These findings are a stark warning to the industry — current LLMs are far from general-purpose reasoners.
The emergence of Large Reasoning Models (LRMs), such as OpenAI's o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking, has been hailed as a significant advancement, potentially marking steps toward more general artificial intelligence. These models characteristically generate responses following detailed 'thinking processes', such as a long Chain-of-Thought sequence, before providing a final answer. While they have shown promising results on various reasoning benchmarks, the capability of benchmarks to judge rapidly evolving models, itself is in doubt.
The researchers cite a comparison between non-thinking LLMs and their 'thinking' evolution. 'At low complexity, non-thinking models are more accurate and token-efficient. As complexity increases, reasoning models outperform but require more tokens—until both collapse beyond a critical threshold, with shorter traces,' they say. The illustrative example of the Claude 3.7 Sonnet and Claude 3.7 Sonnet Thinking illustrates how both models retain accuracy till complexity level three, after which the standard LLM sees a significant drop, something the thinking model too suffers from, a couple of levels later. At the same time, the thinking model is using significantly more tokens.
This research attempted to challenge prevailing evaluation paradigms, which often rely on established mathematical and coding benchmarks, which are otherwise susceptible to data contamination. Such benchmarks also primarily focus on final answer accuracy, providing limited insight into the reasoning process itself, something that is the key differentiator for a 'thinking' model compared with a simpler large language model. To address these gaps, the study utilises controllable puzzle environments — Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World — and these puzzles allow for precise manipulation of problem complexity while maintaining consistent logical structures and rules that must be explicitly followed. That structure theoretically opens a window, a glance at how these models attempt to 'think'.
The findings from this controlled experimental setup reveal significant limitations in current frontier LRMs. One of the most striking observations is the complete accuracy collapse that occurs beyond certain complexity thresholds across all tested reasoning models. This is not a gradual degradation but a sharp drop to near-zero accuracy as problems become sufficiently difficult.
'The state-of-the-art LRMs (e.g., o3-mini, DeepSeek-R1, Claude-3.7-Sonnet-Thinking) still fail to develop generalizable problem-solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities across different environments,' note the researchers.
These results inevitably challenge any notion that the LRMs truly possess generalisation problem-solving skills, required for planning tasks or multi-step processes. The study also identifies a counter-intuitive scaling limit in the models' reasoning effort (this is measured by the inference token usage during the 'thinking' phase), which sees these models initially spend more tokens, but as complexity increases, they actually reduce reasoning effort closer to the inevitable accuracy collapse.
Researchers say that 'despite these claims and performance advancements, the fundamental benefits and limitations of LRMs remain insufficiently understood. Critical questions still persist: Are these models capable of generalizable reasoning, or are they leveraging different forms of pattern matching?,' they ask. There are further questions pertaining to performance scaling with increasing problem complexity, comparisons to the non-thinking standard LLM counterparts when provided with the same inference token compute, and around inherent limitations of current reasoning approaches, as well as improvements that might be necessary to advance toward more robust reasoning.
Where do we go from here?
The researchers make it clear that their test methodology too has limitations. 'While our puzzle environments enable controlled experimentation with fine-grained control over problem complexity, they represent a narrow slice of reasoning tasks and may not capture the diversity of real-world or knowledge intensive reasoning problems,' they say. They do add that the use of 'deterministic puzzle simulators assumes that reasoning can be perfectly validated' at every step, a validation that may not be feasible to such precision in less structured domains. That they say, would restrict validity of analysis to more reasoning.
There is little argument that LRMs represent progress, particularly for the relevance of AI. Yet, this study highlights that not all reasoning models are capable of robust, generalisable reasoning, particularly in the face of increasing complexity. These findings, ahead of WWDC 2025, and from Apple's own researchers, may suggest that any AI reasoning announcements will likely be pragmatic. The focus areas could include specific use cases where current AI methodology is reliable (the research paper indicates lower to medium complexity, less reliance on flawless long-sequence execution) and potentially integrating neural models with traditional computing approaches to handle the complexities where LRMs currently fail. The era of Large Reasoning Models is here, but this 'Illusion of thinking' study is that AI with true reasoning, remains a mirage.

Hashtags

Science

#DeepSeek-R1

#WorldwideDeveloperConference

#WWDC

#TheIllusionofThinking:UnderstandingtheStrengths

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Xiaomi's game-changing chip stuns tech world, even China cheers while the U.S. watches nervously

Economic Times

5 hours ago

Economic Times

Xiaomi's game-changing chip stuns tech world, even China cheers while the U.S. watches nervously

What is Xiaomi's Xring 01 chip and why is it a big deal? 2 Cortex-X925 cores at 3.9 GHz (for top-level performance) 4 Cortex-A725 cores at 3.4 GHz (for high-load tasks) 2 Cortex-A725 cores at 1.9 GHz (for medium use) 2 Cortex-A520 cores at 1.8 GHz (for efficiency and power saving) How much did Xiaomi invest in developing the Xring 01 chip? Live Events Can Xiaomi's chip really compete with Apple and Qualcomm? Why is the Chinese government praising the Xring 01 chip? What does the future hold for Xiaomi and China's chip industry? FAQs: (You can now subscribe to our (You can now subscribe to our Economic Times WhatsApp channel Xring 01 chip is making global headlines, not just for its powerful performance but for what it represents — a major leap in China's bid to become a tech powerhouse. This is Xiaomi's first true high-end processor, developed entirely in-house and launched to compete directly with Qualcomm's Snapdragon 8 and Apple's A18 Pro. In a rare move, even the Chinese government publicly praised the chip, underlining its national significance. After four years of intense development, this chip could reshape the balance in the global semiconductor Xiaomi Xring 01 is a newly launched System-on-a-Chip (SoC) designed to power flagship devices like the Xiaomi 15S Pro and Xiaomi Pad 7 Ultra. Built on licensed ARM architecture, it uses a unique ten-core configuration split into four clusters:This architecture lets the chip balance raw power with energy efficiency. According to Xiaomi, it's been benchmarked to outperform Qualcomm's Snapdragon 8 Elite and even the Apple A18 Pro — two of the most powerful chips on the market Xring 01 is the result of four years of R&D, involving over 2,500 engineers and a massive investment of 13.5 billion yuan (around €1.67 billion). This is not a one-time push. Xiaomi's CEO, Lei Jun, confirmed that the company plans to pour another €6 billion into semiconductor development over the next ten chip also marks the birth of a new processor line for Xiaomi, as the '01' naming signals the beginning of an entire Xring family. The company isn't just trying to make devices faster — it's aiming to become a serious player in the global chip tests say yes — and that's a big deal. Xiaomi claims the Xring 01 matches or beats Qualcomm's top-tier Snapdragon 8 Elite and performs better than Apple's A18 Pro. These are early benchmarks, so real-world performance might vary, but the results have already turned isn't Xiaomi's first time building chips. Back in 2017, they introduced the Surge S1, but it never made a serious impact. The Xring 01, by contrast, appears to be in a completely different class, positioning Xiaomi alongside the world's most advanced tech has long wanted to reduce its dependence on foreign tech — especially American chips. That's why the launch of a powerful homegrown chip like the Xring 01 has drawn praise from Chinese officials, who see it as a national achievement in the global tech rising tensions between China and the United States, chip technology has become a strategic battleground. After the U.S. blocked Huawei from accessing chip-making giant TSMC, Chinese firms have rushed to find local alternatives. Huawei now uses SMIC, a Chinese chip foundry, but it still can't compete at the sub-7nm for now, still uses TSMC to manufacture the Xring 01. But with global pressure mounting, the company is reportedly working on a 'Plan B', anticipating the kind of trade sanctions that hit breakthrough is more than just a technical win — it's a geopolitical milestone. With the Xring 01, the company has proven it can design chips that rival or surpass the global best. What's left is manufacturing autonomy. If China can close that final gap, it could challenge the U.S. and Taiwan's dominance in now, Xiaomi has joined the frontlines of this tech revolution. As more Chinese firms follow, the global chip landscape may never look the same Xiaomi's powerful new processor that competes with Apple's A18 Pro and Snapdragon 8 powers the Xiaomi 15S Pro and Xiaomi Pad 7 Ultra.

Apple WWDC 2025: 'Big' updates coming to iPhones, iPads, MacBooks; what may be the 'hardware announcement'

Time of India

6 hours ago

Time of India

Apple WWDC 2025: 'Big' updates coming to iPhones, iPads, MacBooks; what may be the 'hardware announcement'

Apple's annual Worldwide Developers Conference (WWDC) kicks off on June 9, 2025, promising a week of groundbreaking software announcements that will shape the future of its ecosystem. Tired of too many ads? go ad free now Running through June 13, the event will feature a keynote from Apple CEO Tim Cook at 10 a.m. PT (10.30 pm IST), streamed on Apple's website, Developer app, and YouTube channel, alongside an in-person experience for select developers at Apple Park. Like every year, the event will be big on software. New naming and lot of new feaures likely coming to iOS, iPadOS, MacOS and more With the tagline 'Sleek Peek,' WWDC 2025 will see updates to iOS, iPadOS, macOS, watchOS, tvOS, and visionOS, with a focus on design and AI integration. The headline announcement is expected to be a redesign of Apple's operating systems, potentially rebranded with a year-based naming scheme: iOS 26, iPadOS 26, macOS 26, watchOS 26, tvOS 26, and visionOS 26. Reports suggest iOS 26 will undergo its most significant overhaul since iOS 7, introducing a unified design language with floating tab views, updated iconography, and glass-like UI effects inspired by the Vision Pro's aesthetic. This cohesive look aims to rival Google's Android 16 revamp, prioritizing fluidity and personalization across devices. Updates on Apple Intelligence Apple Intelligence, introduced in 2024, will see further enhancements, though not as the centerpiece. Rumors indicate Apple will open its Foundation Models to third-party developers, enabling custom AI-powered features in apps like Safari and Photos, which may be quietly rebranded as 'AI-powered.' Tired of too many ads? go ad free now A long-awaited Siri overhaul remains in development but is unlikely to debut fully. Additionally, an AI-powered coding tool, possibly in partnership with Anthropic, could be introduced for Xcode, streamlining app development. iPadOS 26 may introduce a 'Pro' version with advanced multitasking, improved external display support, and professional-grade apps, catering to power users. watchOS 26 is expected to support third-party Control Center widgets, while tvOS 26 could feature a revamped CarPlay UI and animated lock screen artwork. visionOS 26 may bring native support for gaming controllers like PlayStation and Xbox, alongside minor UI tweaks to bolster Apple's spatial computing push. Likely hardware announcements at WWDC 2025 Hardware announcements are less likely, as Apple's recent M4 MacBook Air release and focus on fall hardware events suggest a software-driven WWDC. However, speculation persists about a potential M4 Ultra chip reveal or Mac Pro updates, though these are considered long shots. Developers will have access to over 100 technical sessions, online labs, and one-on-one consultations with Apple experts, covering topics like Swift, machine learning, and AR/VR tools. The Swift Student Challenge will also spotlight emerging talent, with 50 Distinguished Winners invited to a three-day Apple Park experience. As Apple aims to balance innovation with its privacy-first ethos, WWDC 2025 is set to lay the groundwork for a smarter, more cohesive ecosystem, with iOS 26 leading the charge. Stay tuned for live updates as Apple unveils its vision for the future.

Time of India

6 hours ago

Time of India

Xiaomi's game-changing chip stuns tech world, even China cheers while the U.S. watches nervously

Xiaomi Xring 01 chip sets new benchmark, challenges Apple and Qualcomm as China applauds breakthrough- Xiaomi Xring 01 chip is making global headlines, not just for its powerful performance but for what it represents — a major leap in China's bid to become a tech powerhouse. This is Xiaomi's first true high-end processor, developed entirely in-house and launched to compete directly with Qualcomm's Snapdragon 8 and Apple's A18 Pro. In a rare move, even the Chinese government publicly praised the chip, underlining its national significance. After four years of intense development, this chip could reshape the balance in the global semiconductor industry. What is Xiaomi's Xring 01 chip and why is it a big deal? The Xiaomi Xring 01 is a newly launched System-on-a-Chip (SoC) designed to power flagship devices like the Xiaomi 15S Pro and Xiaomi Pad 7 Ultra. Built on licensed ARM architecture, it uses a unique ten-core configuration split into four clusters: 2 Cortex-X925 cores at 3.9 GHz (for top-level performance) 4 Cortex-A725 cores at 3.4 GHz (for high-load tasks) 2 Cortex-A725 cores at 1.9 GHz (for medium use) 2 Cortex-A520 cores at 1.8 GHz (for efficiency and power saving) This architecture lets the chip balance raw power with energy efficiency. According to Xiaomi, it's been benchmarked to outperform Qualcomm's Snapdragon 8 Elite and even the Apple A18 Pro — two of the most powerful chips on the market today. by Taboola by Taboola Sponsored Links Sponsored Links Promoted Links Promoted Links You May Like Play War Thunder now for free War Thunder Play Now Undo How much did Xiaomi invest in developing the Xring 01 chip? The Xring 01 is the result of four years of R&D, involving over 2,500 engineers and a massive investment of 13.5 billion yuan (around €1.67 billion). This is not a one-time push. Xiaomi's CEO, Lei Jun, confirmed that the company plans to pour another €6 billion into semiconductor development over the next ten years. This chip also marks the birth of a new processor line for Xiaomi, as the '01' naming signals the beginning of an entire Xring family. The company isn't just trying to make devices faster — it's aiming to become a serious player in the global chip industry. Live Events Can Xiaomi's chip really compete with Apple and Qualcomm? Initial tests say yes — and that's a big deal. Xiaomi claims the Xring 01 matches or beats Qualcomm's top-tier Snapdragon 8 Elite and performs better than Apple's A18 Pro. These are early benchmarks, so real-world performance might vary, but the results have already turned heads. This isn't Xiaomi's first time building chips. Back in 2017, they introduced the Surge S1, but it never made a serious impact. The Xring 01, by contrast, appears to be in a completely different class, positioning Xiaomi alongside the world's most advanced tech companies. Why is the Chinese government praising the Xring 01 chip? China has long wanted to reduce its dependence on foreign tech — especially American chips. That's why the launch of a powerful homegrown chip like the Xring 01 has drawn praise from Chinese officials, who see it as a national achievement in the global tech race. With rising tensions between China and the United States, chip technology has become a strategic battleground. After the U.S. blocked Huawei from accessing chip-making giant TSMC, Chinese firms have rushed to find local alternatives. Huawei now uses SMIC, a Chinese chip foundry, but it still can't compete at the sub-7nm level. Xiaomi, for now, still uses TSMC to manufacture the Xring 01. But with global pressure mounting, the company is reportedly working on a 'Plan B', anticipating the kind of trade sanctions that hit Huawei. What does the future hold for Xiaomi and China's chip industry? Xiaomi's breakthrough is more than just a technical win — it's a geopolitical milestone. With the Xring 01, the company has proven it can design chips that rival or surpass the global best. What's left is manufacturing autonomy. If China can close that final gap, it could challenge the U.S. and Taiwan's dominance in semiconductors. For now, Xiaomi has joined the frontlines of this tech revolution. As more Chinese firms follow, the global chip landscape may never look the same again. FAQs: Q1. What is the Xiaomi Xring 01 chip? It's Xiaomi's powerful new processor that competes with Apple's A18 Pro and Snapdragon 8 Elite. Q2. Which devices use the Xiaomi Xring 01 chip? It powers the Xiaomi 15S Pro and Xiaomi Pad 7 Ultra.

Thinking AI models collapse in face of complex problems, Apple researchers find

Hashtags

Try Our AI Features

Comments

Related Articles

Xiaomi's game-changing chip stuns tech world, even China cheers while the U.S. watches nervously

Apple WWDC 2025: 'Big' updates coming to iPhones, iPads, MacBooks; what may be the 'hardware announcement'

Xiaomi's game-changing chip stuns tech world, even China cheers while the U.S. watches nervously

Get Started Now: Download the App