ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
- URL: http://arxiv.org/abs/2505.11831v1
- Date: Sat, 17 May 2025 04:34:48 GMT
- Title: ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
- Authors: Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, Henry Pinkard,
- Abstract summary: ARC-AGI-2 preserves the input-output pair task format of its predecessor, ensuring continuity for researchers.<n>It incorporates a newly curated and expanded set of tasks specifically designed to assess abstract reasoning and problem-solving abilities.<n> ARC-AGI-2 aims to serve as a next-generation tool for rigorously measuring progress towards more general and human-like AI capabilities.
- Score: 0.03431023404301193
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI), introduced in 2019, established a challenging benchmark for evaluating the general fluid intelligence of artificial systems via a set of unique, novel tasks only requiring minimal prior knowledge. While ARC-AGI has spurred significant research activity over the past five years, recent AI progress calls for benchmarks capable of finer-grained evaluation at higher levels of cognitive complexity. We introduce ARC-AGI-2, an upgraded version of the benchmark. ARC-AGI-2 preserves the input-output pair task format of its predecessor, ensuring continuity for researchers. It incorporates a newly curated and expanded set of tasks specifically designed to provide a more granular signal to assess abstract reasoning and problem-solving abilities at higher levels of fluid intelligence. To contextualize the difficulty and characteristics of ARC-AGI-2, we present extensive results from human testing, providing a robust baseline that highlights the benchmark's accessibility to human intelligence, yet difficulty for current AI systems. ARC-AGI-2 aims to serve as a next-generation tool for rigorously measuring progress towards more general and human-like AI capabilities.
Related papers
- ARC-NCA: Towards Developmental Solutions to the Abstraction and Reasoning Corpus [0.0]
ARC-NCA is a developmental approach to tackle the ARC-AGI benchmark.<n> Developmental solutions may offer a promising avenue for enhancing AI's problem-solving capabilities.
arXiv Detail & Related papers (2025-05-13T17:55:43Z) - General Scales Unlock AI Evaluation with Explanatory and Predictive Power [57.7995945974989]
benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems.<n>We introduce general scales for AI evaluation that can explain what common AI benchmarks really measure.<n>Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate.
arXiv Detail & Related papers (2025-03-09T01:13:56Z) - Understanding and Benchmarking Artificial Intelligence: OpenAI's o3 Is Not AGI [0.0]
OpenAI's o3 achieves a high score of 87.5 % on ARC-AGI, a benchmark proposed to measure intelligence.<n>This raises the question whether systems based on Large Language Models (LLMs), particularly o3, demonstrate intelligence and progress towards artificial general intelligence (AGI)
arXiv Detail & Related papers (2025-01-13T16:28:01Z) - ML Research Benchmark [0.0]
We present the ML Research Benchmark (MLRB), comprising 7 competition-level tasks derived from recent machine learning conference tracks.
This paper introduces a novel benchmark and evaluates it using agent scaffolds powered by frontier models, including Claude-3 and GPT-4o.
The results indicate that the Claude-3.5 Sonnet agent performs best across our benchmark, excelling in planning and developing machine learning models.
arXiv Detail & Related papers (2024-10-29T21:38:42Z) - A-Bench: Are LMMs Masters at Evaluating AI-generated Images? [78.3699767628502]
A-Bench is a benchmark designed to diagnose whether multi-modal models (LMMs) are masters at evaluating AI-generated images (AIGIs)<n>Ultimately, 2,864 AIGIs from 16 text-to-image models are sampled, each paired with question-answers annotated by human experts, and tested across 18 leading LMMs.
arXiv Detail & Related papers (2024-06-05T08:55:02Z) - How Far Are We From AGI: Are LLMs All We Need? [15.705756259264932]
AGI is distinguished by its ability to execute diverse real-world tasks with efficiency and effectiveness comparable to human intelligence.
This paper outlines the requisite capability frameworks for AGI, integrating the internal, interface, and system dimensions.
To give tangible insights into the ubiquitous impact of the integration of AI, we outline existing challenges and potential pathways toward AGI in multiple domains.
arXiv Detail & Related papers (2024-05-16T17:59:02Z) - Generative AI Agent for Next-Generation MIMO Design: Fundamentals, Challenges, and Vision [76.4345564864002]
Next-generation multiple input multiple output (MIMO) is expected to be intelligent and scalable.
We propose the concept of the generative AI agent, which is capable of generating tailored and specialized contents.
We present two compelling case studies that demonstrate the effectiveness of leveraging the generative AI agent for performance analysis.
arXiv Detail & Related papers (2024-04-13T02:39:36Z) - Levels of AGI for Operationalizing Progress on the Path to AGI [64.59151650272477]
We propose a framework for classifying the capabilities and behavior of Artificial General Intelligence (AGI) models and their precursors.
This framework introduces levels of AGI performance, generality, and autonomy, providing a common language to compare models, assess risks, and measure progress along the path to AGI.
arXiv Detail & Related papers (2023-11-04T17:44:58Z) - Exploration with Principles for Diverse AI Supervision [88.61687950039662]
Training large transformers using next-token prediction has given rise to groundbreaking advancements in AI.
While this generative AI approach has produced impressive results, it heavily leans on human supervision.
This strong reliance on human oversight poses a significant hurdle to the advancement of AI innovation.
We propose a novel paradigm termed Exploratory AI (EAI) aimed at autonomously generating high-quality training data.
arXiv Detail & Related papers (2023-10-13T07:03:39Z) - The ConceptARC Benchmark: Evaluating Understanding and Generalization in
the ARC Domain [0.0]
We describe an in-depth evaluation benchmark for the Abstraction and Reasoning Corpus (ARC)
In particular, we describe ConceptARC, a new, publicly available benchmark in the ARC domain.
We report results on testing humans on this benchmark as well as three machine solvers.
arXiv Detail & Related papers (2023-05-11T21:06:39Z) - OpenAGI: When LLM Meets Domain Experts [51.86179657467822]
Human Intelligence (HI) excels at combining basic skills to solve complex tasks.
This capability is vital for Artificial Intelligence (AI) and should be embedded in comprehensive AI Agents.
We introduce OpenAGI, an open-source platform designed for solving multi-step, real-world tasks.
arXiv Detail & Related papers (2023-04-10T03:55:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.