Related papers: ARC Prize 2025: Technical Report

ARC Prize 2025: Technical Report

URL: http://arxiv.org/abs/2601.10904v1
Date: Thu, 15 Jan 2026 23:23:56 GMT
Title: ARC Prize 2025: Technical Report
Authors: François Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers,
Abstract summary: ARC-AGI benchmark series serves as a critical measure of few-shot generalization on novel tasks.<n>The ARC Prize 2025 global competition targeted the newly released ARC-AGI-2 dataset.<n>The defining theme of 2025 is the emergence of the refinement loop.
Score: 0.45671221781968335
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The ARC-AGI benchmark series serves as a critical measure of few-shot generalization on novel tasks, a core aspect of intelligence. The ARC Prize 2025 global competition targeted the newly released ARC-AGI-2 dataset, which features greater task complexity compared to its predecessor. The Kaggle competition attracted 1,455 teams and 15,154 entries, with the top score reaching 24% on the ARC-AGI-2 private evaluation set. Paper submissions nearly doubled year-over-year to 90 entries, reflecting the growing research interest in fluid intelligence and abstract reasoning. The defining theme of 2025 is the emergence of the refinement loop -- a per-task iterative program optimization loop guided by a feedback signal. Refinement loops come in a variety of forms, in particular evolutionary program synthesis approaches and application-layer refinements to commercial AI systems. Such refinement loops are also possible in weight space, as evidenced by zero-pretraining deep learning methods which are now achieving competitive performance with remarkably small networks (7M parameters). In parallel, four frontier AI labs (Anthropic, Google DeepMind, OpenAI, and xAI) reported ARC-AGI performance in public model cards in 2025, establishing ARC-AGI as an industry standard benchmark for AI reasoning. However, our analysis indicates that current frontier AI reasoning performance remains fundamentally constrained to knowledge coverage, giving rise to new forms of benchmark contamination. In this paper, we survey the top-performing methods, examine the role of refinement loops in AGI progress, discuss knowledge-dependent overfitting, and preview ARC-AGI-3, which introduces interactive reasoning challenges that require exploration, planning, memory, goal acquisition, and alignment capabilities.

Related papers

Step-DeepResearch Technical Report [90.50586290399683]
We introduce Step-DeepResearch, a cost-effective, end-to-end agent.<n>We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing.<n>To bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios.
arXiv Detail & Related papers (2025-12-23T16:32:27Z)
Let the Barbarians In: How AI Can Accelerate Systems Performance Research [80.43506848683633]
We term this iterative cycle of generation, evaluation, and refinement AI-Driven Research for Systems.<n>We demonstrate that ADRS-generated solutions can match or even outperform human state-of-the-art designs.
arXiv Detail & Related papers (2025-12-16T18:51:23Z)
ARC-GEN: A Mimetic Procedural Benchmark Generator for the Abstraction and Reasoning Corpus [3.553493344868413]
This paper introduces ARC-GEN, an open-source procedural generator aimed at extending the original ARC-AGI training dataset.<n>Unlike prior efforts, our generator is both exhaustive (covering all four-hundred tasks) and mimetic.<n>We also discuss the use of this generator in establishing a static benchmark suite to verify the correctness of programs submitted to the 2025 Google Code Golf Championship.
arXiv Detail & Related papers (2025-10-31T18:10:05Z)
Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models [72.52332895840279]
GenCluster is a test-time compute framework that attains IOI gold-level performance using open-weight models.<n>We will show that GenCluster can achieve a gold medal at IOI 2025 for the first time with an open-weight model.
arXiv Detail & Related papers (2025-10-16T02:19:25Z)
ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems [0.03431023404301193]
ARC-AGI-2 preserves the input-output pair task format of its predecessor, ensuring continuity for researchers.<n>It incorporates a newly curated and expanded set of tasks specifically designed to assess abstract reasoning and problem-solving abilities.<n> ARC-AGI-2 aims to serve as a next-generation tool for rigorously measuring progress towards more general and human-like AI capabilities.
arXiv Detail & Related papers (2025-05-17T04:34:48Z)
General Scales Unlock AI Evaluation with Explanatory and Predictive Power [57.7995945974989]
benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems.<n>We introduce general scales for AI evaluation that can explain what common AI benchmarks really measure.<n>Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate.
arXiv Detail & Related papers (2025-03-09T01:13:56Z)
Competitive Programming with Large Reasoning Models [73.7455809592467]
We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks.<n>We compare two general-purpose reasoning models - OpenAI o1 and an early checkpoint of o3 - with a domain-specific system, o1-ioi.<n>Our findings show that although specialized pipelines such as o1-ioi yield solid improvements, the scaled-up, general-purpose o3 model surpasses those results without relying on hand-crafted inferences.
arXiv Detail & Related papers (2025-02-03T23:00:15Z)
Understanding and Benchmarking Artificial Intelligence: OpenAI's o3 Is Not AGI [0.0]
OpenAI's o3 achieves a high score of 87.5 % on ARC-AGI, a benchmark proposed to measure intelligence.<n>This raises the question whether systems based on Large Language Models (LLMs), particularly o3, demonstrate intelligence and progress towards artificial general intelligence (AGI)
arXiv Detail & Related papers (2025-01-13T16:28:01Z)
ARC Prize 2024: Technical Report [0.036355666825174035]
As of December 2024, the ARC-AGI benchmark is five years old and remains unbeaten.<n>This year, we launched ARC Prize, a global competition to inspire new ideas and drive open progress towards AGI.<n>As a result, the state-of-the-art score on the ARC-AGI private evaluation set increased from 33% to 55.5%.
arXiv Detail & Related papers (2024-12-05T20:40:28Z)
How Far Are We From AGI: Are LLMs All We Need? [15.705756259264932]
AGI is distinguished by its ability to execute diverse real-world tasks with efficiency and effectiveness comparable to human intelligence. This paper outlines the requisite capability frameworks for AGI, integrating the internal, interface, and system dimensions. To give tangible insights into the ubiquitous impact of the integration of AI, we outline existing challenges and potential pathways toward AGI in multiple domains.
arXiv Detail & Related papers (2024-05-16T17:59:02Z)
OpenAGI: When LLM Meets Domain Experts [51.86179657467822]
Human Intelligence (HI) excels at combining basic skills to solve complex tasks. This capability is vital for Artificial Intelligence (AI) and should be embedded in comprehensive AI Agents. We introduce OpenAGI, an open-source platform designed for solving multi-step, real-world tasks.
arXiv Detail & Related papers (2023-04-10T03:55:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.