Why AI Keeps Failing Its Own Hardest Test (2026)

Q: What is the ARC-AGI benchmark and why does it matter?

ARC-AGI is a benchmark created by Francois Chollet at Google DeepMind to measure fluid intelligence in AI systems. It presents visual grid puzzles where each task has a unique rule that must be inferred from a few examples with no prior training exposure. It matters because it specifically isolates novel reasoning ability from knowledge retrieval, exposing the gap that standard benchmarks miss.

Q: Why do AI models that pass bar exams fail ARC-AGI tests?

Bar exams and knowledge benchmarks like MMLU test crystallized intelligence: stored knowledge from training data. ARC-AGI tests fluid intelligence: deriving new rules from minimal examples with no prior exposure. Large language models are extremely efficient at the first task and very poor at the second because their architecture is optimized for pattern retrieval, not novel rule generation.

Q: What score did OpenAI o3 get on ARC-AGI?

OpenAI's o3 model scored 87.5 percent on ARC-AGI-1 in low-efficiency mode, approaching the human individual baseline of 85 percent or above. However, this required approximately $4,560 per task compared to roughly $5 per task for a human. On the harder ARC-AGI-2, released in 2025, o3-preview scored only 4 percent, while humans average 60 percent.

Q: Does o3 scoring near human levels on ARC-AGI-1 mean AGI is here?

No. The ARC Prize Foundation released ARC-AGI-2 precisely because ARC-AGI-1 was being approached through compute scaling rather than genuine reasoning advances. On ARC-AGI-2, which humans solve at 60 percent average, the best AI systems score 4 percent or below. The efficiency gap, where humans use minimal compute and AI requires massive expenditure, also signals that the underlying capability is not equivalent.

Q: What would true artificial general intelligence need to demonstrate?

According to Francois Chollet's framework, true AGI would need to demonstrate skill-acquisition efficiency: the ability to generalize to novel problems from very few examples, across multiple domains, at a computational cost comparable to human cognitive effort. It would need to pass a benchmark like ARC-AGI reliably, without requiring training-data shortcuts or disproportionate compute, while performing consistently on new task distributions it has never encountered.

AI fails intelligence tests like ARC-AGI because current systems excel at retrieving and recombining patterns from training data, not at reasoning through genuinely novel problems from scratch. When a task has no overlap with anything the model was trained on, performance collapses to near zero, regardless of how well that same model performs on law exams or coding benchmarks.

That gap between what AI can do and what it cannot do is not a minor technical detail. It is the central unresolved question in artificial intelligence research, and a benchmark called ARC-AGI has spent five years making that gap impossible to ignore. The best AI systems in the world, models that routinely outperform human experts on standardized tests, still score in the low single digits on the hardest version of this test. Humans score around 60 percent on average, without any special preparation.

Understanding why this happens requires looking at what “intelligence” actually means, how researchers have tried to measure it, and why the results keep revealing the same uncomfortable truth about current AI systems. Here is exactly what the data shows.

What the ARC-AGI Test Actually Measures (And Why It Is Different)

ARC-AGI stands for Abstraction and Reasoning Corpus for Artificial General Intelligence. Francois Chollet, a researcher at Google DeepMind and the creator of the Keras deep learning library, designed it in 2019 specifically to measure what he calls fluid intelligence: the ability to reason through brand-new problems using minimal examples, with no prior training on similar tasks.

Each task in the benchmark is a visual grid puzzle. The system receives two or three input-output example pairs showing a transformation, then must apply the same rule to a new input it has never seen. The transformation rule is different for every single puzzle. There is no overlap with training data, by design. Every task is a first encounter.

This is the exact opposite of how AI models are typically evaluated. Most benchmarks, including MMLU (Massive Multitask Language Understanding), test whether a model has absorbed and retained factual knowledge across dozens of domains. They measure what Chollet calls crystallized intelligence: stored knowledge that was accumulated over time. ARC-AGI measures something else entirely. It measures whether a system can figure out something it was never shown, using only the reasoning capacity it has at the moment of the test.

Chollet’s formal definition, published in his 2019 paper on measuring intelligence, frames the concept through information theory: intelligence is skill-acquisition efficiency, meaning how much new capability a system can generate per unit of prior knowledge and experience. By this standard, a system that requires billions of training examples to perform well on a narrow task is demonstrating low intelligence, even if the task outputs look impressive to a casual observer.

The benchmark was designed around one explicit principle: tasks must be easy for any adult human, with no special expertise required, and hard for AI. That asymmetry is the measurement. The size of the gap between human and AI performance tells you how far AI is from demonstrating genuine fluid intelligence.

How Current AI Models Score (The Specific Numbers)

The scoring trajectory on ARC-AGI version 1 took four years to move meaningfully. GPT-3 scored approximately 0 percent when the benchmark launched in 2020. GPT-4o, released in 2024 after four years of rapid AI progress, scored approximately 5 percent. The models that were rewriting competitive coding, acing medical licensing exams, and generating publishable scientific text were essentially failing a visual reasoning test that any ten-year-old could partially solve.

Then came OpenAI‘s o3 model in December 2024, and the numbers changed dramatically. On the semi-private ARC-AGI-1 evaluation set of 100 tasks, o3 scored 75.7 percent in high-efficiency mode and 87.5 percent in low-efficiency mode. On the public evaluation set, low-efficiency o3 reached 91.5 percent. Human performance on the same benchmark sits above 85 percent for individuals and above 95 percent for a review panel. For the first time, a single AI model was approaching human-range performance on ARC-AGI-1.

The catch is in the compute cost. High-efficiency o3 cost approximately $26 per task, or $2,680 to run the 100-task evaluation. Low-efficiency o3 cost $4,560 per task, meaning a full run cost roughly $456,000. A human solving the same tasks costs around $5 per task. Even setting aside the question of what the score means conceptually, the economic gap is enormous.

The ARC Prize Foundation responded to o3’s performance by releasing ARC-AGI-2, a substantially harder version of the benchmark, in early 2025 with a $725,000 prize pool. Every task in ARC-AGI-2 was verified as solvable by at least two humans in under two attempts. Current AI scores on this version: pure language models score 0 percent. The o3-preview model in low-efficiency mode scores 4 percent. DeepSeek‘s r1 model scores 0.3 percent. GPT-4.5 scores 0 percent. Humans average 60 percent and reach 100 percent as a panel. The gap reopened wider than ever.

Why Passing Language Tasks Does Not Equal Intelligence

The confusion about AI intelligence comes from a category error that is easy to make and hard to shake. When a model passes the bar exam in the top 10 percent of test-takers, it looks like it understands law. When a model scores above the human expert average on the 57-subject MMLU benchmark, covering medicine, history, mathematics, and ethics, it looks like it understands a great deal. When Claude, Gemini Ultra, and GPT-4 all cross the 86 percent threshold on MMLU, a benchmark where the human expert baseline is about 89.8 percent, the inference seems reasonable: these systems are approaching expert-level intelligence.

But Chollet’s central argument, backed now by five years of ARC-AGI data, is that this inference is wrong. Language model benchmarks test crystallized intelligence: the retrieval and recombination of knowledge that was already present in training data. The bar exam, MMLU, and coding benchmarks like HumanEval all have substantial overlap with content that models were trained on. The models are not reasoning about these problems from first principles. They are doing something closer to extraordinarily sophisticated pattern matching across a vast compressed archive of human text.

ARC-AGI strips that advantage away by design. Every puzzle is novel. There is no prior exposure to memorize, no pattern in the training set to interpolate from. The system must derive the rule from the examples shown in the prompt and apply it correctly to a new case. That is the operation that current language models are very bad at. It is also, according to most cognitive scientists, the operation that defines what humans mean when they use the word “intelligence.”

The o3 model partially closes this gap, but not through the same mechanism humans use. OpenAI‘s o3 achieves its ARC-AGI-1 scores through what researchers describe as natural language program search at test time: generating and evaluating many candidate reasoning chains through extended computation before committing to an answer. It is a different approach from prior models, and it works better on ARC tasks. But it requires orders-of-magnitude more compute than a human uses, and it still scores only 4 percent on ARC-AGI-2. The human cognitive architecture being simulated here remains far more efficient at genuine novel reasoning than anything currently deployed.

What “Real” AI Intelligence Would Actually Look Like

Defining genuine machine intelligence is genuinely difficult, and researchers disagree on the details. But Chollet’s framework, grounded in information theory and cognitive science, offers a clear target: a system demonstrates intelligence when it can acquire new skills efficiently from minimal examples, generalize those skills to situations it has never encountered, and do so without requiring the amount of prior training data that effectively encodes the answer in advance.

By this standard, no current AI system fully qualifies. o3 comes closer than anything before it on ARC-AGI-1, but the mechanism it uses, brute-force search across reasoning chains, is not the same as the compressed, efficient generalization that human cognition performs. A person solving an ARC task spends seconds looking at the examples and identifies the rule almost immediately. That process happens with minimal compute, no prior exposure, and reliable transfer to the new case. The underlying cognitive operation remains largely unexplained by current AI architectures.

What would real AI intelligence look like in practice? Researchers point to a few markers. A system would need to generalize from very few examples, ideally one or two, to solve a problem class it has never seen before. It would need to do this at a cost comparable to human cognitive effort, not at $4,500 per task. It would need to perform consistently across multiple novel domains without requiring task-specific fine-tuning. And it would need to do it on a benchmark designed with the specific intention of ruling out training-data shortcuts, like ARC-AGI-2.

The broader scientific community is converging on a related concept: compositional generalization, the ability to combine learned primitives in new arrangements to solve problems that were never explicitly trained. Current transformer-based models, including every major model from OpenAI, Anthropic, and Google DeepMind, show limited compositional generalization on tasks specifically designed to test it. This is not a problem of scale. Adding more parameters or more training data does not reliably close the compositional gap. It is an architectural challenge that remains unsolved.

What This Means for AGI Timelines

The ARC-AGI results complicate both optimistic and pessimistic narratives about when artificial general intelligence might arrive. The optimistic read on o3’s 87.5 percent score on ARC-AGI-1 was that AGI was close, perhaps imminent. The subsequent release of ARC-AGI-2, where the best systems score in the low single digits against a 60 percent human average, corrected that reading sharply. The goalposts did not move arbitrarily. The ARC Prize Foundation released the harder version specifically because ARC-AGI-1 was being solved primarily through compute scaling rather than genuine reasoning advances.

That dynamic is important for understanding AI timelines. The benchmark is designed to be a moving target. As AI systems find ways to pass one version, a harder version is released that preserves the original property: tasks that are easy for humans and hard for AI. If AI progress on ARC continues to require exponentially more compute to gain each percentage point, while humans maintain near-perfect scores at near-zero cost, the gap in efficiency is not closing. It may be widening in the dimension that matters most.

This does not mean AGI is decades away in an absolute sense. It means the path to AGI requires solving a problem that current architectures are not solving: efficient novel generalization. Researchers at OpenAI, Anthropic, Google DeepMind, and academic institutions around the world are working on this directly, exploring architectures that incorporate more explicit reasoning modules, memory structures, and program synthesis components. Progress is real but not yet sufficient to close the ARC-AGI-2 gap.

What the ARC-AGI test ultimately reveals is that the question “is this AI intelligent?” requires specifying which kind of intelligence you mean. For crystallized intelligence, the kind measured by MMLU, bar exams, and knowledge-based benchmarks, AI is at or near human expert level right now. For fluid intelligence, the kind measured by novel problem-solving with minimal prior exposure, AI remains far from human performance. Both facts are true simultaneously. The challenge for the field is that fluid intelligence is, by most accounts, the harder and more important half of the equation.

Frequently Asked Questions About AI and Intelligence Tests

What is the ARC-AGI benchmark and why does it matter?

ARC-AGI is a benchmark created by Francois Chollet at Google DeepMind to measure fluid intelligence in AI systems. It presents visual grid puzzles where each task has a unique rule that must be inferred from a few examples with no prior training exposure. It matters because it specifically isolates novel reasoning ability from knowledge retrieval, exposing the gap that standard benchmarks miss.

Why do AI models that pass bar exams fail ARC-AGI tests?

Bar exams and knowledge benchmarks like MMLU test crystallized intelligence: stored knowledge from training data. ARC-AGI tests fluid intelligence: deriving new rules from minimal examples with no prior exposure. Large language models are extremely efficient at the first task and very poor at the second because their architecture is optimized for pattern retrieval, not novel rule generation.

What score did OpenAI o3 get on ARC-AGI?

OpenAI‘s o3 model scored 87.5 percent on ARC-AGI-1 in low-efficiency mode, approaching the human individual baseline of 85 percent or above. However, this required approximately $4,560 per task compared to roughly $5 per task for a human. On the harder ARC-AGI-2, released in 2025, o3-preview scored only 4 percent, while humans average 60 percent.

Does o3 scoring near human levels on ARC-AGI-1 mean AGI is here?

No. The ARC Prize Foundation released ARC-AGI-2 precisely because ARC-AGI-1 was being approached through compute scaling rather than genuine reasoning advances. On ARC-AGI-2, which humans solve at 60 percent average, the best AI systems score 4 percent or below. The efficiency gap, where humans use minimal compute and AI requires massive expenditure, also signals that the underlying capability is not equivalent.

What would true artificial general intelligence need to demonstrate?

According to Francois Chollet‘s framework, true AGI would need to demonstrate skill-acquisition efficiency: the ability to generalize to novel problems from very few examples, across multiple domains, at a computational cost comparable to human cognitive effort. It would need to pass a benchmark like ARC-AGI reliably, without requiring training-data shortcuts or disproportionate compute, while performing consistently on new task distributions it has never encountered.

The ARC-AGI test is not a trick or an artificially impossible standard. It is a carefully designed measure of the cognitive property that makes human intelligence generalizable: the capacity to encounter something genuinely new and figure it out. Current AI systems, for all their remarkable capabilities, have not yet cracked that problem at human efficiency levels. The field knows it, the benchmarks confirm it, and the research investment going into solving it is substantial. Whether the solution requires a new architecture, a new training paradigm, or a conceptual breakthrough that nobody has articulated yet, the ARC-AGI gap remains one of the most honest and important open questions in science.

Why AI Keeps Failing Its Own Hardest Test (And What That Reveals About Intelligence)

What the ARC-AGI Test Actually Measures (And Why It Is Different)

How Current AI Models Score (The Specific Numbers)

Why Passing Language Tasks Does Not Equal Intelligence

What “Real” AI Intelligence Would Actually Look Like

What This Means for AGI Timelines

Frequently Asked Questions About AI and Intelligence Tests

What is the ARC-AGI benchmark and why does it matter?

Why do AI models that pass bar exams fail ARC-AGI tests?

What score did OpenAI o3 get on ARC-AGI?

Does o3 scoring near human levels on ARC-AGI-1 mean AGI is here?

What would true artificial general intelligence need to demonstrate?

Leave a Comment Cancel Reply

Sign up to receive email updates, fresh news and more!

What the ARC-AGI Test Actually Measures (And Why It Is Different)

How Current AI Models Score (The Specific Numbers)

Why Passing Language Tasks Does Not Equal Intelligence

What “Real” AI Intelligence Would Actually Look Like

What This Means for AGI Timelines

Frequently Asked Questions About AI and Intelligence Tests

What is the ARC-AGI benchmark and why does it matter?

Why do AI models that pass bar exams fail ARC-AGI tests?

What score did OpenAI o3 get on ARC-AGI?

Does o3 scoring near human levels on ARC-AGI-1 mean AGI is here?

What would true artificial general intelligence need to demonstrate?

You Might Also Like

Leave a Comment Cancel Reply