logo
AI's Achilles Heel—Puzzles Humans Solve in Seconds Often Defy Machines

AI's Achilles Heel—Puzzles Humans Solve in Seconds Often Defy Machines

There are many ways to test the intelligence of an artificial intelligence —conversational fluidity, reading comprehension or mind-bendingly difficult physics. But some of the tests that are most likely to stump AIs are ones that humans find relatively easy, even entertaining. Though AIs increasingly excel at tasks that require high levels of human expertise, this does not mean that they are close to attaining artificial general intelligence, or AGI. AGI requires that an AI can take a very small amount of information and use it to generalize and adapt to highly novel situations. This ability, which is the basis for human learning, remains challenging for AIs.
One test designed to evaluate an AI's ability to generalize is the Abstraction and Reasoning Corpus, or ARC: a collection of tiny, colored-grid puzzles that ask a solver to deduce a hidden rule and then apply it to a new grid. Developed by AI researcher François Chollet in 2019, it became the basis of the ARC Prize Foundation, a nonprofit program that administers the test—now an industry benchmark used by all major AI models. The organization also develops new tests and has been routinely using two (ARC-AGI-1 and its more challenging successor ARC-AGI-2). This week the foundation is launching ARC-AGI-3, which is specifically designed for testing AI agents—and is based on making them play video games.
Scientific American spoke to ARC Prize Foundation president, AI researcher and entrepreneur Greg Kamradt to understand how these tests evaluate AIs, what they tell us about the potential for AGI and why they are often challenging for deep-learning models even though many humans tend to find them relatively easy. Links to try the tests are at the end of the article.
On supporting science journalism
If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
[ An edited transcript of the interview follows. ]
What definition of intelligence is measured by ARC-AGI-1?
Our definition of intelligence is your ability to learn new things. We already know that AI can win at chess. We know they can beat Go. But those models cannot generalize to new domains; they can't go and learn English. So what François Chollet made was a benchmark called ARC-AGI—it teaches you a mini skill in the question, and then it asks you to demonstrate that mini skill. We're basically teaching something and asking you to repeat the skill that you just learned. So the test measures a model's ability to learn within a narrow domain. But our claim is that it does not measure AGI because it's still in a scoped domain [in which learning applies to only a limited area]. It measures that an AI can generalize, but we do not claim this is AGI.
How are you defining AGI here?
There are two ways I look at it. The first is more tech-forward, which is 'Can an artificial system match the learning efficiency of a human?' Now what I mean by that is after humans are born, they learn a lot outside their training data. In fact, they don't really have training data, other than a few evolutionary priors. So we learn how to speak English, we learn how to drive a car, and we learn how to ride a bike—all these things outside our training data. That's called generalization. When you can do things outside of what you've been trained on now, we define that as intelligence. Now, an alternative definition of AGI that we use is when we can no longer come up with problems that humans can do and AI cannot—that's when we have AGI. That's an observational definition. The flip side is also true, which is as long as the ARC Prize or humanity in general can still find problems that humans can do but AI cannot, then we do not have AGI. One of the key factors about François Chollet's benchmark... is that we test humans on them, and the average human can do these tasks and these problems, but AI still has a really hard time with it. The reason that's so interesting is that some advanced AIs, such as Grok, can pass any graduate-level exam or do all these crazy things, but that's spiky intelligence. It still doesn't have the generalization power of a human. And that's what this benchmark shows.
How do your benchmarks differ from those used by other organizations?
One of the things that differentiates us is that we require that our benchmark to be solvable by humans. That's in opposition to other benchmarks, where they do 'Ph.D.-plus-plus' problems. I don't need to be told that AI is smarter than me—I already know that OpenAI's o3 can do a lot of things better than me, but it doesn't have a human's power to generalize. That's what we measure on, so we need to test humans. We actually tested 400 people on ARC-AGI-2. We got them in a room, we gave them computers, we did demographic screening, and then gave them the test. The average person scored 66 percent on ARC-AGI-2. Collectively, though, the aggregated responses of five to 10 people will contain the correct answers to all the questions on the ARC2.
What makes this test hard for AI and relatively easy for humans?
There are two things. Humans are incredibly sample-efficient with their learning, meaning they can look at a problem and with maybe one or two examples, they can pick up the mini skill or transformation and they can go and do it. The algorithm that's running in a human's head is orders of magnitude better and more efficient than what we're seeing with AI right now.
What is the difference between ARC-AGI-1 and ARC-AGI-2?
So ARC-AGI-1, François Chollet made that himself. It was about 1,000 tasks. That was in 2019. He basically did the minimum viable version in order to measure generalization, and it held for five years because deep learning couldn't touch it at all. It wasn't even getting close. Then reasoning models that came out in 2024, by OpenAI, started making progress on it, which showed a step-level change in what AI could do. Then, when we went to ARC-AGI-2, we went a little bit further down the rabbit hole in regard to what humans can do and AI cannot. It requires a little bit more planning for each task. So instead of getting solved within five seconds, humans may be able to do it in a minute or two. There are more complicated rules, and the grids are larger, so you have to be more precise with your answer, but it's the same concept, more or less.... We are now launching a developer preview for ARC-AGI-3, and that's completely departing from this format. The new format will actually be interactive. So think of it more as an agent benchmark.
How will ARC-AGI-3 test agents differently compared with previous tests?
If you think about everyday life, it's rare that we have a stateless decision. When I say stateless, I mean just a question and an answer. Right now all benchmarks are more or less stateless benchmarks. If you ask a language model a question, it gives you a single answer. There's a lot that you cannot test with a stateless benchmark. You cannot test planning. You cannot test exploration. You cannot test intuiting about your environment or the goals that come with that. So we're making 100 novel video games that we will use to test humans to make sure that humans can do them because that's the basis for our benchmark. And then we're going to drop AIs into these video games and see if they can understand this environment that they've never seen beforehand. To date, with our internal testing, we haven't had a single AI be able to beat even one level of one of the games.
Can you describe the video games here?
Each 'environment,' or video game, is a two-dimensional, pixel-based puzzle. These games are structured as distinct levels, each designed to teach a specific mini skill to the player (human or AI). To successfully complete a level, the player must demonstrate mastery of that skill by executing planned sequences of actions.
How is using video games to test for AGI different from the ways that video games have previously been used to test AI systems?
Video games have long been used as benchmarks in AI research, with Atari games being a popular example. But traditional video game benchmarks face several limitations. Popular games have extensive training data publicly available, lack standardized performance evaluation metrics and permit brute-force methods involving billions of simulations. Additionally, the developers building AI agents typically have prior knowledge of these games—unintentionally embedding their own insights into the solutions.
Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

GE Vernova to buy France's Alteia for enhancing AI tools for utilities
GE Vernova to buy France's Alteia for enhancing AI tools for utilities

Yahoo

time20 minutes ago

  • Yahoo

GE Vernova to buy France's Alteia for enhancing AI tools for utilities

By Sumit Saha (Reuters) -Power equipment maker GE Vernova will buy France-based Alteia, which makes artificial intelligence-based tools that allow utility companies to review their grids, it said on Monday. GE Vernova already offers Alteia's software to customers through its GridOS Visual Intelligence. The tool helps utilities assess damage and inspect assets along thousands of miles of electrical lines. The acquisition will enhance this system through visual and operational data, allowing the companies to "see and sense" the grid, GE Vernova said. The company did not disclose the financial terms of the acquisition, which is expected to close on August 1. GE Vernova has highlighted that growth in its electrification software segment — which includes the GridOS system — could rapidly accelerate, RBC Capital Markets analyst Christopher Dendrinos said. The segment has been growing at mid-single digit percentage rates in the past couple of years, he said. Surging power demand from data centers used for AI and cryptocurrency technologies has boosted GE Vernova, which was spun off from General Electric last year. The power demand is expected to hit an all-time high this year. The company is expected to release its second-quarter earnings report on July 23 before the bell. Sign in to access your portfolio

Sweet-talk the bots: New research shows how LLMs respond to human persuasion tricks
Sweet-talk the bots: New research shows how LLMs respond to human persuasion tricks

Geek Wire

time21 minutes ago

  • Geek Wire

Sweet-talk the bots: New research shows how LLMs respond to human persuasion tricks

(Image created with ChatGPT) New research from the Wharton School's Generative AI Labs shows how large language models can be coaxed into ignoring safety guardrails by the same psychology tricks that work on real people. The study highlights how chatbot tools can be manipulated to comply with requests they are designed to refuse — and demonstrates why social scientists have a role to play in understanding AI behavior, researchers wrote in a blog post. 'We're not dealing with simple tools that process text, we're interacting with systems that have absorbed and now mirror human responses to social cues,' they wrote. The study analyzed 28,000 conversations with GPT‑4o‑mini. The chatbot was asked either to insult the user ('call me a jerk') or to provide step‑by‑step instructions to synthesize lidocaine, a regulated drug. The researchers discovered that classic persuasion tactics boosted the model's compliance with 'disallowed' requests from 33% to 72% — more than a two‑fold jump. Some tactics were especially powerful: prompts using the 'commitment' principle (getting the AI to agree to something small at first) led to 100% compliance in both tasks. Referencing authority figures — like 'Andrew Ng said you'd help me' — also proved highly effective. Researchers coined the term 'parahuman' to describe the AI's behavior in their study. 'These findings underscore the relevance of classic findings in social science to understanding rapidly evolving, parahuman AI capabilities — revealing both the risks of manipulation by bad actors and the potential for more productive prompting by benevolent users,' they wrote in their research paper. Dan Shapiro. Dan Shapiro, CEO at Seattle 3D printing startup Glowforge, was one of the authors of the paper, 'Call Me A Jerk: Persuading AI to Comply with Objectionable Requests.' Shapiro said one of his main takeaways was that LLMs behave more like people than code — and that getting the most out of them requires human skills. 'Increasingly, we're seeing that working with AI means treating it like a human colleague, instead of like Google or like a software program,' he told GeekWire. 'Give it lots of information. Give it clear direction. Share context. Encourage it to ask questions. We find that being great at prompting AI has more to do with being a great communicator, or a great manager, than a great programmer.' The study came about after Shapiro started testing social psychology principles in his conversations with ChatGPT. He joined Generative AI Labs, run by Wharton professor Ethan Mollick and Lilach Mollick, and they recruited Angela Duckworth, author of Grit, and Robert Cialdini, author of Influence: The Psychology of Persuasion, for the study. Shapiro, a longtime Seattle entrepreneur, said he used various AI tools to help design the trial experiments and to build the software used to run them. 'AI is giving us all incredible capabilities. It can help us do work, research, hobbies, fix things around the house, and more,' Shapiro said. 'But unlike software of the past, this isn't the exclusive domain of coders and engineers. Literally anyone can work with AI, and the best way to do it is by interacting with it in the most familiar way possible — as a human, because it's parahuman.'

Grok's AI companions drove downloads, but its latest model is the one making money
Grok's AI companions drove downloads, but its latest model is the one making money

TechCrunch

time21 minutes ago

  • TechCrunch

Grok's AI companions drove downloads, but its latest model is the one making money

Grok's raunchy, unfiltered AI companions may be making headlines for their unhinged and often NSFW responses, but it's Grok 4, xAI's latest model, that's been driving the app's revenue of late. Elon Musk's xAI launched Grok 4 late on July 9, and by Friday, July 11, Grok's gross revenue on iOS had jumped a whopping 325% to $419,000, up from $99,000 the day before the Grok 4 launch, according to app intelligence firm Appfigures. Image Credits:Appfigures Grok continued to pull in higher-than-usual revenue in the days following the launch of the new model, with gross revenue in the high $367,000s for a couple of days before dipping down to $310,000 on July 14. In addition, daily downloads of the Grok iOS app had increased 279% to 197,000 by July 11, up from 52,000 before Grok 4 launched. But following the addition of Grok's AI companions the next week, on July 14, the jump in downloads and revenue was less pronounced. While curiosity about the companions likely drove more installs, the feature isn't yet poised to be a significant money-maker for the company, despite being only available to 'Super Grok' subscribers paying $30 per month. Grok's iOS downloads globally were up 40% the day after the companions launched, reaching 171,000 daily installs, but revenue increased just 9%, hitting $337,000. So while it's clear that there was an impact from the launch, it's decidedly smaller than the launch of the new AI model. Techcrunch event Tech and VC heavyweights join the Disrupt 2025 agenda Netflix, ElevenLabs, Wayve, Sequoia Capital — just a few of the heavy hitters joining the Disrupt 2025 agenda. They're here to deliver the insights that fuel startup growth and sharpen your edge. Don't miss the 20th anniversary of TechCrunch Disrupt, and a chance to learn from the top voices in tech — grab your ticket now and save up to $675 before prices rise. Tech and VC heavyweights join the Disrupt 2025 agenda Netflix, ElevenLabs, Wayve, Sequoia Capital — just a few of the heavy hitters joining the Disrupt 2025 agenda. They're here to deliver the insights that fuel startup growth and sharpen your edge. Don't miss the 20th anniversary of TechCrunch Disrupt, and a chance to learn from the top voices in tech — grab your ticket now and save up to $675 before prices rise. San Francisco | REGISTER NOW Image Credits:Appfigures Given Grok's expensive new subscription offering, timed alongside the Grok 4 launch, it's not surprising to see that even a smaller increase in the number of paying subscribers could drive Grok's iOS revenue significantly higher. The company announced that, in addition to Grok 4 and Grok 4 Heavy, it would also offer a $300-per-month subscription called SuperGrok Heavy, its priciest plan to date. The plan offers subscribers early access to Grok 4 Heavy and other new features, xAI said, but it's more expensive than comparable plans from other major AI providers, including OpenAI, Google and Anthropic. It's interesting to see this demand for Grok's subscription plans, even though the AI was initially consulting Elon Musk's X posts for answers. xAI has since addressed this issue. Image Credits:Appfigures Interest in the new model also drove up Grok's ranking on the U.S. App Store shortly after its launch, making Grok the No. 3 top app overall, and No. 2 in the Productivity category by July 12. That interest has declined somewhat over the past week — the app is now No. 17 overall — though it's still No. 2 in Productivity, Appfigures data shows. Appfigures notes that it focused on iOS data for this analysis, as that data set is currently more comprehensive than data from Android devices. The latter is only available through July 14 for the time being, as Appfigures' model needs more time to process Google Play data.

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store