H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark
- URL: http://arxiv.org/abs/2409.01374v1
- Date: Mon, 2 Sep 2024 17:11:32 GMT
- Title: H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark
- Authors: Solim LeGris, Wai Keen Vong, Brenden M. Lake, Todd M. Gureckis,
- Abstract summary: Since 2019, limited progress has been observed on the challenge using existing artificial intelligence methods.
Previous work explored how well humans can solve tasks from the ARC benchmark.
We obtain a more robust estimate of human performance by evaluating 1729 humans on the full set of 400 training and 400 evaluation tasks.
- Score: 7.840781070208872
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Abstraction and Reasoning Corpus (ARC) is a visual program synthesis benchmark designed to test challenging out-of-distribution generalization in humans and machines. Since 2019, limited progress has been observed on the challenge using existing artificial intelligence methods. Comparing human and machine performance is important for the validity of the benchmark. While previous work explored how well humans can solve tasks from the ARC benchmark, they either did so using only a subset of tasks from the original dataset, or from variants of ARC, and therefore only provided a tentative estimate of human performance. In this work, we obtain a more robust estimate of human performance by evaluating 1729 humans on the full set of 400 training and 400 evaluation tasks from the original ARC problem set. We estimate that average human performance lies between 73.3% and 77.2% correct with a reported empirical average of 76.2% on the training set, and between 55.9% and 68.9% correct with a reported empirical average of 64.2% on the public evaluation set. However, we also find that 790 out of the 800 tasks were solvable by at least one person in three attempts, suggesting that the vast majority of the publicly available ARC tasks are in principle solvable by typical crowd-workers recruited over the internet. Notably, while these numbers are slightly lower than earlier estimates, human performance still greatly exceeds current state-of-the-art approaches for solving ARC. To facilitate research on ARC, we publicly release our dataset, called H-ARC (human-ARC), which includes all of the submissions and action traces from human participants.
Related papers
- PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks [57.89516354418451]
We present a benchmark for Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR)
We employ a semi-automated task generation pipeline using Large Language Models (LLMs)
We analyze state-of-the-art LLMs on PARTNR tasks, across the axes of planning, perception and skill execution.
arXiv Detail & Related papers (2024-10-31T17:53:12Z) - SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories.
Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development.
We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z) - Towards Automation of Human Stage of Decay Identification: An Artificial Intelligence Approach [3.2048813174244795]
This study explores the feasibility of automating two common human decomposition scoring methods using artificial intelligence (AI)
We evaluated two popular deep learning models, Inception V3 and Xception, by training them on a large dataset of human decomposition images.
The Xception model achieved the best classification performance, with macro-averaged F1 scores of.878,.881, and.702 for the head, torso, and limbs.
arXiv Detail & Related papers (2024-08-19T21:00:40Z) - Sources of Gain: Decomposing Performance in Conditional Average Dose Response Estimation [0.9332308328407303]
Estimating conditional average dose responses (CADR) is an important but challenging problem.
Our paper analyses this practice and shows that using popular benchmark datasets without further analysis is insufficient to judge model performance.
We propose a novel decomposition scheme that allows the evaluation of the impact of five distinct components contributing to CADR estimator performance.
arXiv Detail & Related papers (2024-06-12T13:39:32Z) - Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling [50.08315607506652]
We propose a Constrained Active Sampling Framework (CASF) for reliable human judgment.
Experiment results show CASF receives 93.18% top-ranked system recognition accuracy.
arXiv Detail & Related papers (2024-06-12T07:44:36Z) - Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments.
Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources.
In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z) - UniHCP: A Unified Model for Human-Centric Perceptions [75.38263862084641]
We propose a Unified Model for Human-Centric Perceptions (UniHCP)
UniHCP unifies a wide range of human-centric tasks in a simplified end-to-end manner with the plain vision transformer architecture.
With large-scale joint training on 33 human-centric datasets, UniHCP can outperform strong baselines by direct evaluation.
arXiv Detail & Related papers (2023-03-06T07:10:07Z) - Real-Time Visual Feedback to Guide Benchmark Creation: A
Human-and-Metric-in-the-Loop Workflow [22.540665278228975]
We propose VAIDA, a novel benchmark creation paradigm for NLP.
VAIDA focuses on guiding crowdworkers, an under-explored facet of addressing benchmark idiosyncrasies.
We find that VAIDA decreases effort, frustration, mental, and temporal demands of crowdworkers and analysts.
arXiv Detail & Related papers (2023-02-09T04:43:10Z) - Bottom-Up 2D Pose Estimation via Dual Anatomical Centers for Small-Scale
Persons [75.86463396561744]
In multi-person 2D pose estimation, the bottom-up methods simultaneously predict poses for all persons.
Our method achieves 38.4% improvement on bounding box precision and 39.1% improvement on bounding box recall over the state of the art (SOTA)
For the human pose AP evaluation, we achieve a new SOTA (71.0 AP) on the COCO test-dev set with the single-scale testing.
arXiv Detail & Related papers (2022-08-25T10:09:10Z) - A Review for Deep Reinforcement Learning in Atari:Benchmarks,
Challenges, and Solutions [0.0]
Arcade Learning Environment (ALE) is proposed as an evaluation platform for empirically assessing the generality of agents across Atari 2600 games.
From Deep Q-Networks (DQN) to Agent57, RL agents seem to achieve superhuman performance in ALE.
We propose a novel Atari benchmark based on human world records (HWR), which puts forward higher requirements for RL agents on both final performance and learning efficiency.
arXiv Detail & Related papers (2021-12-08T06:52:23Z) - The effectiveness of feature attribution methods and its correlation
with automatic evaluation scores [19.71360639210631]
We conduct the first, large-scale user study on 320 lay and 11 expert users to shed light on the effectiveness of state-of-the-art attribution methods.
We found that, in overall, feature attribution is surprisingly not more effective than showing humans nearest training-set examples.
arXiv Detail & Related papers (2021-05-31T13:23:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.