Latest news with #ProblemComplexity

Despite Billions In Investment, AI Reasoning Models Are Falling Short

Forbes

2 days ago

Science
Forbes

Despite Billions In Investment, AI Reasoning Models Are Falling Short

In early June, Apple released an explosive paper, The Illusion of Thinking: Understanding the Limitations of Reasoning Models via the Lens of Problem Complexity. It examines the reasoning ability of Large Reasoning Models (LRMs) such as Claude 3.7 Sonnet Thinking, Gemini Thinking, DeepSeek-R1, and OpenAI's o-series models — how they think, especially as problem complexity increases. The research community dug in, and the responses were swift. Despite the increasing adoption of Generative AI and the adoption and the presumption that AI will replace tasks and jobs at scale, these Large Reasoning Models are falling short. By definition, Large Reasoning Models (LRMs) are Large Language Models (LLMS) focused on step-by-step thinking. This is called Chain of Thought (CoT) which facilitates problem solving by guiding the model to articulate reasoning steps. Jing Hu, writer, researcher and author of 2nd Order Thinkers, who dissected the paper's findings remarked that "AI is just sophisticated pattern matching, no thinking, no reasoning" and 'AI can only do tasks accurately up to a certain degree of complexity.' As part of the study, researchers created a closed puzzle environment for games like Checkers Jumping, River Crossing, and Tower of Hanoi, which simulate varied conditions of complexity. Puzzles were applied across three stages of complexity ranging from the simplest to high complexity. Across all three stages of the models' performances the paper concluded: At 'Low Complexity', the regular models performed better than LRMs. Hu explained, 'The reasoning models were overthinking — wrote thousands of words, exploring paths that weren't needed, second-guessing correct answers and making things more complicated than they should have been.' In the Tower of Hanoi, a human can solve the puzzle within seven moves while Claude-3.7 Sonnet Thinking uses '10x more tokens compared to the regular version while achieving the same accuracy... it's like driving a rocket ship to the corner store.' At 'Medium Complexity', LRMs outperformed LLMs, revealing traces of chain of thought reasoning . Hu argues LRMs tended to explore wrong answers first before eventually finding the correct answer, however, she argues, 'these thinking models use 10-50x more compute power (15,000-20,000 tokes vs. 1,000-5000). Imagine paying $500 instead of $50 for a hamburger that tastes 10% better.' Hu says that this isn't an impressive breakthrough but reveal a complexity that is dressed to impress audiences and 'simple enough to avoid total failure.' At 'High Complexity', both LRMs and standard models collapse, and accuracy drops to zero. As the problems get more complex, the models simply stopped trying. Hu explains, referencing Figure 6 from Apple's paper, 'Accuracy starts high for all models on simple tasks, dips slowly then crashes to near zero at a 'critical point' of complexity. If this compared to the row displaying Token use, the latter rises as problems become harder ('models think more'), peaks, then drops sharply at the same critical point even if token budget is still available.' Hu explains models aren't scaling up their effort; rather they abandon real reasoning and output less. Gary Marcus is an authority on AI. He's a scientist and has written several books including The Algebraic Mind and Rebooting AI. He continues to scrutinize the releases from these AI companies. In his response to Apple's paper, he states, 'it echoes and amplifies the training distribution argument that I have been making since 1998: neural networks of various kinds can generalize within a training distribution of data they are exposed to, but their generalizations tend to break down outside that distribution.' This means the more edge cases introduced to these LRMs the more they will go off-track especially with problems that are very different from the training data. He also advises that LRMs have a scaling problem because 'the outputs would require too many output tokens' indicating the correct answer would be too long for the LRMs to produce. The implications? Hu advises, 'This comparison matters because it debunks hype around LRMs by showing they only shine on medium complexity tasks, not simple or extreme ones.' Why this Hedge Fund CEO Passes on GenAI Ryan Pannell is the CEO of Kaiju Worldwide, a technology research and investment firm specializing in predictive artificial intelligence and algorithmic trading. He plays in an industry that demands compliance and a stronger level of certainty. He uses Predictive AI, which is a type of artificial intelligence leveraging statistical analysis and machine learning to forecast based on patterns on historical data; unlike generative AI like LLM and LRM chatbots, it does not create original content. Sound data is paramount and for the hedge funds, they only leverage closed datasets, as Pannell explains, 'In our work with price, time, and quantity, the analysis isn't influenced by external factors — the integrity of the data is reliable, as long as proper precautions are taken, such as purchasing quality data sets and putting them through rigorous quality control processes, ensuring only fully sanitized data are used.' The data they purchase — price, time, and quantity — are from three different vendors and when they compare their outputs, 99.999% of the time they all match. However, when there's an error — since some data vendors occasionally provide incorrect price, time, or quantity information — the other two usually point out the mistake. Pannell argues, 'This is why we use data from three sources. Predictive systems don't hallucinate because they aren't guessing.' For Kaiju, the predictive model uses only what it knows and whatever new data they collect to spot patterns they use to predict what will come next. 'In our case, we use it to classify market regimes — bull, bear, neutral, or unknown. We've fed them trillions of transactions and over four terabytes of historical price and quantity data. So, when one of them outputs 'I don't know,' it means it's encountered something genuinely unprecedented.' He claims that if it sees loose patterns and predicts a bear market with 75% certainty, it's likely correct, however, 'I don't know,' signals a unique scenario, something never seen in decades of market data. 'That's rare, but when it happens, it's the most fascinating for us,' says Pannell. In 2017, when Trump policy changes caused major trade disruptions, Pannell asserted these systems were not in place so the gains they made within this period of high uncertainty were mostly luck. But the system today, which has experienced this level of volatility before, can perform well, and with consistency. AI Detection and the Anomaly of COVID-19 Just before the dramatic drop of the stock market of February 2020, the stock market was still at an all-time high. However, Pannell noted that the system was signaling that something was very wrong and the strange behavior in the market kept intensifying, 'The system estimated a 96% chance of a major drop and none of us knew exactly why at the time. That's the challenge with explainability — AI can't tell you about news events, like a cruise ship full of sick people or how COVID spread across the world. It simply analyzes price, time and quantity patterns and predicts a fall based on changing behavior it is seeing, even though it has no awareness of the underlying reasons. We, on the other hand, were following the news as humans do.' The news pointed to this 'COVID-19' thing, at the time it seemed isolated. Pannell's team weren't sure what to expect but in hindsight he realized the value of the system: it analyzes terabytes of data and billions of examinations daily for any recognizable pattern and sometimes determines what's happening matches nothing it has seen before.' In those cases, he realized, the system acted as an early warning, allowing them to increase their hedges. With the billions of dollars generated from these predictive AI systems, their efficacy drops off after a week to ~21%-17% and making trades outside this range is extremely risky. Pannell suggests he hasn't seen any evidence suggesting AI — of any kind — will be able to predict financial markets with accuracy 90 days, six months or a year in advance. 'There are simply too many unpredictable factors involved. Predictive AI is highly accurate in the immediate future — between today and tomorrow — because the scope of possible changes is limited.' Pannell remains skeptical on the promises of LLMs and the current LRMS for his business. He describes wasting three hours being lied to by ChatGPT 4.o when he was experimenting with using it to architecting a new framework. He was blown away the system had substantially increased its functionality at first, but he determined after three hours, it was lying to him the entire time. He explains, 'When I asked, 'Do you have the capability to do what you just said?' the system responded it did not and added that its latest update had programmed it to keep him engaged over giving an honest answer.' Pannell adds, 'Within a session, an LLM can adjust when I give it feedback, like 'don't do this again,' but as soon as the session goes for too long, it forgets and starts lying again.' He also points to ChatGPT's memory constraints. He noted it performs really well for the first hour but in the second or third hour, ChatGPT starts forgetting earlier context, making mistakes and dispensing false information. He described it to a colleague this way, 'It's like working with an extremely talented but completely drunk programmer. It does some impressive work, but it also over-estimates its capabilities, lies about what it can and can't do, delivers some well-written code, wrecks a bunch of stuff, apologizes and says it won't do it again, tells me that my ideas are brilliant and that I am 'right for holding it accountable', and then repeats the whole process over and over again. The experience can be chaotic.' Could Symbolic AI be the Answer? Catriona Kennedy holds a Ph.D. in Computer Science from the University of Birmingham and is an independent researcher focusing on cognitive systems and ethical automation. Kennedy explains that automated reasoning has always been a branch of AI with the inference engine at the core, which applies the rules of logic to a set of statements that are encoded in a formal language. She explains, 'An inference engine is like a calculator, but unlike AI, it operates on symbols and statements instead of numbers. It is designed to be correct.' It is designed to deduce new information, simulating the decision-making of a human expert. Generative AI, in comparison, is a statistical generator, therefore prone to hallucinations because 'they do not interpret the logic of the text in the prompt.' This is the heart of symbolic AI, one that uses an inference engine and allows for human experience and authorship. It is a distinct AI system from generative AI. The difference with Symbolic AI is the knowledge structure. She explains, 'You have your data and connect it with knowledge allowing you to classify the data based on what you know. Metadata is an example of knowledge. It describes what data exists and what it means and this acts as a knowledge base linking data to its context — such as how it was obtained and what it represents.' Kennedy also adds ontologies are becoming popular again. Ontology defines all the things that exist and the interdependent properties and relationships. As an example, animal is a class, and a sub class is a bird and a further sub-class is an eagle or robin. The properties of a bird: Has 2 feet, has feathers, and flies. However, what an eagle eats may be different from what a robin eats. Ontologies and metadata can connect with logic-based rules to ensure the correct reasoning based on defined relationships. The main limitation of pure symbolic AI is that it doesn't easily scale. Kennedy points out that these knowledge structures can become unwieldy. While it excels at special purpose tasks, it becomes brittle at very complex levels and difficult to manage when dealing with large, noisy or unpredictable data sets. What we have today in current LRMs has not yet satisfied these researchers that AI models are any closer to thinking like humans, as Marcus points out, 'our argument is not that humans don't have any limits, but LRMs do, and that's why they aren't intelligent... based on what we observe from their thoughts, their process is not logical and intelligent.' For Jing Hu, she concludes, "Too much money depends on the illusion of progress — there is a huge financial incentive to keep the hype going even if the underlying technology isn't living up to the promises. Stop the Blind worship of GenAI." (Note: Open AI recently raised $40billion with a post-money valuation of $300billion.) For hedge fund CEO, Ryan Pannell, combining generative AI (which can handle communication and language) with predictive systems (which can accurately process data in closed environments) would be ideal. As he explains, 'The challenge is that predictive AI usually doesn't have a user-friendly interface; it communicates in code and math, not plain English. Most people can't access or use these tools directly.' He opts for integrating GPT as an intermediary, 'where you ask GPT for information, and it relays that request to a predictive system and then shares the results in natural language—it becomes much more useful. In this role, GPT acts as an effective interlocutor between the user and the predictive model.' Gary Marcus believes by combining symbolic AI with neural networks — which is coined Neurosymbolic AI — connecting data to knowledge that leverage human thought processes, the result will be better. He explains that this will provide a robust AI capable of 'reasoning, learning and cognitive modelling.' Marcus laments that for four decades, the elites that have evolved machine-learning, 'closed-minded egotists with too much money and power' have 'tried to keep a good idea, namely neurosymbolic AI, down — only to accidentally vindicate the idea in the end." 'Huge vindication for what I have been saying all along: we need AI that integrates both neural networks and symbolic algorithms and representations (such as logic, code, knowledge graphs, etc.). But also, we need to do so reliably, and in a general way, and we haven't yet crossed that threshold.'

Latest news with #ProblemComplexity

Despite Billions In Investment, AI Reasoning Models Are Falling Short

Get Started Now: Download the App