ARC Prize 2024: Technical Report
- URL: http://arxiv.org/abs/2412.04604v2
- Date: Wed, 08 Jan 2025 05:24:50 GMT
- Title: ARC Prize 2024: Technical Report
- Authors: Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers,
- Abstract summary: As of December 2024, the ARC-AGI benchmark is five years old and remains unbeaten.<n>This year, we launched ARC Prize, a global competition to inspire new ideas and drive open progress towards AGI.<n>As a result, the state-of-the-art score on the ARC-AGI private evaluation set increased from 33% to 55.5%.
- Score: 0.036355666825174035
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As of December 2024, the ARC-AGI benchmark is five years old and remains unbeaten. We believe it is currently the most important unsolved AI benchmark in the world because it seeks to measure generalization on novel tasks -- the essence of intelligence -- as opposed to skill at tasks that can be prepared for in advance. This year, we launched ARC Prize, a global competition to inspire new ideas and drive open progress towards AGI by reaching a target benchmark score of 85\%. As a result, the state-of-the-art score on the ARC-AGI private evaluation set increased from 33\% to 55.5\%, propelled by several frontier AGI reasoning techniques including deep learning-guided program synthesis and test-time training. In this paper, we survey top approaches, review new open-source implementations, discuss the limitations of the ARC-AGI-1 dataset, and share key insights gained from the competition.
Related papers
- ARC Prize 2025: Technical Report [0.45671221781968335]
ARC-AGI benchmark series serves as a critical measure of few-shot generalization on novel tasks.<n>The ARC Prize 2025 global competition targeted the newly released ARC-AGI-2 dataset.<n>The defining theme of 2025 is the emergence of the refinement loop.
arXiv Detail & Related papers (2026-01-15T23:23:56Z) - Trajectories and Comparative Analysis of Global Countries Dominating AI Publications, 2000-2025 [0.0]
The US and the European Union (EU27), once the undisputed and established leaders, have experienced a notable decline in relative dominance.<n>China has undergone a dramatic ascent, expanding its global share of AI publications from under 5% in 2000 to nearly 36% by 2025.<n>These empirical findings highlight the strategic implications of concentrated research output.
arXiv Detail & Related papers (2025-09-29T16:35:54Z) - AIM 2025 challenge on Inverse Tone Mapping Report: Methods and Results [157.40117175150962]
This paper reviews the AIM 2025 Challenge on Inverse Tone Mapping (ITM)<n>The challenge aimed to push forward the development of effective ITM algorithms for HDR image reconstruction from single LDR inputs.<n>This report consolidates their methodologies and performance, with the lowest PU21-PSNR among the top entries reaching 29.22 dB.
arXiv Detail & Related papers (2025-08-19T03:18:22Z) - Don't throw the baby out with the bathwater: How and why deep learning for ARC [0.0]
Abstraction and Reasoning (ARC-AGI) presents a formidable challenge for AI systems.<n>We propose a methodology for training on ARC, starting from pretrained LLMs, and enhancing their ARC reasoning.<n>We are the first to propose and show deep learning can be used effectively for ARC, showing boosts of up to 260% in accuracy with AIRV and a further 300% boost with TTFT.
arXiv Detail & Related papers (2025-06-17T07:40:39Z) - ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems [0.03431023404301193]
ARC-AGI-2 preserves the input-output pair task format of its predecessor, ensuring continuity for researchers.<n>It incorporates a newly curated and expanded set of tasks specifically designed to assess abstract reasoning and problem-solving abilities.<n> ARC-AGI-2 aims to serve as a next-generation tool for rigorously measuring progress towards more general and human-like AI capabilities.
arXiv Detail & Related papers (2025-05-17T04:34:48Z) - The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report [170.81876816944754]
The NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR) aims to advance the development of models that optimize key computational metrics.<n>This paper meticulously analyzes these methods and results, emphasizing groundbreaking advancements in state-of-the-art single-image ESR techniques.
arXiv Detail & Related papers (2025-04-14T20:18:21Z) - SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines [118.8024915014751]
Large language models (LLMs) have demonstrated remarkable proficiency in academic disciplines such as mathematics, physics, and computer science.
However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks.
We present SuperGPQA, a benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines.
arXiv Detail & Related papers (2025-02-20T17:05:58Z) - Understanding and Benchmarking Artificial Intelligence: OpenAI's o3 Is Not AGI [0.0]
OpenAI's o3 achieves a high score of 87.5 % on ARC-AGI, a benchmark proposed to measure intelligence.
This raises the question whether systems based on Large Language Models (LLMs), particularly o3, demonstrate intelligence and progress towards artificial general intelligence (AGI)
arXiv Detail & Related papers (2025-01-13T16:28:01Z) - Neuro-Symbolic AI in 2024: A Systematic Review [0.29260385019352086]
The review followed the PRISMA methodology, utilizing databases such as IEEE Explore, Google Scholar, arXiv, ACM, and SpringerLink.
From an initial pool of 1,428 papers, 167 met the inclusion criteria and were analyzed in detail.
The majority of research efforts are concentrated in the areas of learning and inference, logic and reasoning, and knowledge representation.
arXiv Detail & Related papers (2025-01-09T18:48:35Z) - AI Predicts AGI: Leveraging AGI Forecasting and Peer Review to Explore LLMs' Complex Reasoning Capabilities [0.3428444467046466]
We tasked 16 state-of-the-art large language models with estimating the likelihood of Artificial General Intelligence (AGI) emerging by 2030.
To assess the quality of these forecasts, we implemented an automated peer review process (LLM-PR)
arXiv Detail & Related papers (2024-12-12T15:52:41Z) - H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark [7.840781070208872]
Since 2019, limited progress has been observed on the challenge using existing artificial intelligence methods.
Previous work explored how well humans can solve tasks from the ARC benchmark.
We obtain a more robust estimate of human performance by evaluating 1729 humans on the full set of 400 training and 400 evaluation tasks.
arXiv Detail & Related papers (2024-09-02T17:11:32Z) - CRAG -- Comprehensive RAG Benchmark [58.15980697921195]
Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge.
Existing RAG datasets do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks.
To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG)
CRAG is a factual question answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search.
arXiv Detail & Related papers (2024-06-07T08:43:07Z) - PoCo: Point Context Cluster for RGBD Indoor Place Recognition [47.12179061883084]
We present a novel end-to-end algorithm (PoCo) for the indoor RGB-D place recognition task, aimed at identifying the most likely match for a given query frame within a reference database.
We propose a new network architecture, which generalizes the recent Context of Clusters (CoCs) to extract global descriptors directly from the noisy point clouds through end-to-end learning.
arXiv Detail & Related papers (2024-04-03T17:38:15Z) - ICDAR 2023 Competition on Hierarchical Text Detection and Recognition [60.68100769639923]
The competition is aimed to promote research into deep learning models and systems that can jointly perform text detection and recognition.
We present details of the proposed competition organization, including tasks, datasets, evaluations, and schedule.
During the competition period (from January 2nd 2023 to April 1st 2023), at least 50 submissions from more than 20 teams were made in the 2 proposed tasks.
arXiv Detail & Related papers (2023-05-16T18:56:12Z) - A Review for Deep Reinforcement Learning in Atari:Benchmarks,
Challenges, and Solutions [0.0]
Arcade Learning Environment (ALE) is proposed as an evaluation platform for empirically assessing the generality of agents across Atari 2600 games.
From Deep Q-Networks (DQN) to Agent57, RL agents seem to achieve superhuman performance in ALE.
We propose a novel Atari benchmark based on human world records (HWR), which puts forward higher requirements for RL agents on both final performance and learning efficiency.
arXiv Detail & Related papers (2021-12-08T06:52:23Z) - Benchmarking Graph Neural Networks [75.42159546060509]
Graph neural networks (GNNs) have become the standard toolkit for analyzing and learning from data on graphs.
For any successful field to become mainstream and reliable, benchmarks must be developed to quantify progress.
GitHub repository has reached 1,800 stars and 339 forks, which demonstrates the utility of the proposed open-source framework.
arXiv Detail & Related papers (2020-03-02T15:58:46Z) - Recognizing Families In the Wild: White Paper for the 4th Edition Data
Challenge [91.55319616114943]
This paper summarizes the supported tasks (i.e., kinship verification, tri-subject verification, and search & retrieval of missing children) in the Recognizing Families In the Wild (RFIW) evaluation.
The purpose of this paper is to describe the 2020 RFIW challenge, end-to-end, along with forecasts in promising future directions.
arXiv Detail & Related papers (2020-02-15T02:22:42Z) - Analysing Affective Behavior in the First ABAW 2020 Competition [49.90617840789334]
The Affective Behavior Analysis in-the-wild (ABAW) 2020 Competition is the first Competition aiming at automatic analysis of the three main behavior tasks.
We describe this Competition, to be held in conjunction with the IEEE Conference on Face and Gesture Recognition, May 2020, in Buenos Aires, Argentina.
We outline the evaluation metrics, present both the baseline system and the top-3 performing teams' methodologies per Challenge and finally present their obtained results.
arXiv Detail & Related papers (2020-01-30T15:41:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.