Related papers: H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark

H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark

URL: http://arxiv.org/abs/2409.01374v1
Date: Mon, 2 Sep 2024 17:11:32 GMT
Title: H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark
Authors: Solim LeGris, Wai Keen Vong, Brenden M. Lake, Todd M. Gureckis,
Abstract summary: Since 2019, limited progress has been observed on the challenge using existing artificial intelligence methods. Previous work explored how well humans can solve tasks from the ARC benchmark. We obtain a more robust estimate of human performance by evaluating 1729 humans on the full set of 400 training and 400 evaluation tasks.
Score: 7.840781070208872
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Abstraction and Reasoning Corpus (ARC) is a visual program synthesis benchmark designed to test challenging out-of-distribution generalization in humans and machines. Since 2019, limited progress has been observed on the challenge using existing artificial intelligence methods. Comparing human and machine performance is important for the validity of the benchmark. While previous work explored how well humans can solve tasks from the ARC benchmark, they either did so using only a subset of tasks from the original dataset, or from variants of ARC, and therefore only provided a tentative estimate of human performance. In this work, we obtain a more robust estimate of human performance by evaluating 1729 humans on the full set of 400 training and 400 evaluation tasks from the original ARC problem set. We estimate that average human performance lies between 73.3% and 77.2% correct with a reported empirical average of 76.2% on the training set, and between 55.9% and 68.9% correct with a reported empirical average of 64.2% on the public evaluation set. However, we also find that 790 out of the 800 tasks were solvable by at least one person in three attempts, suggesting that the vast majority of the publicly available ARC tasks are in principle solvable by typical crowd-workers recruited over the internet. Notably, while these numbers are slightly lower than earlier estimates, human performance still greatly exceeds current state-of-the-art approaches for solving ARC. To facilitate research on ARC, we publicly release our dataset, called H-ARC (human-ARC), which includes all of the submissions and action traces from human participants.

Related papers

RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts [4.112091541691995]
We introduce RE-Bench, which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 human experts. We find that the best AI agents achieve a score 4x higher than human experts when both are given a total time budget of 2 hours per environment. Humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2x the score of the top AI agent when both are given 32 total hours (across different attempts).
arXiv Detail & Related papers (2024-11-22T18:30:46Z)
The Surprising Effectiveness of Test-Time Training for Abstract Reasoning [64.36534512742736]
We investigate the effectiveness of test-time training (TTT) as a mechanism for improving models' reasoning capabilities. TTT significantly improves performance on ARC tasks, achieving up to 6x improvement in accuracy compared to base fine-tuned models. Our findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in neural language models.
arXiv Detail & Related papers (2024-11-11T18:59:45Z)
PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks [57.89516354418451]
We present a benchmark for Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR) We employ a semi-automated task generation pipeline using Large Language Models (LLMs) We analyze state-of-the-art LLMs on PARTNR tasks, across the axes of planning, perception and skill execution.
arXiv Detail & Related papers (2024-10-31T17:53:12Z)
SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development. We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z)
Towards Automation of Human Stage of Decay Identification: An Artificial Intelligence Approach [3.2048813174244795]
This study explores the feasibility of automating two common human decomposition scoring methods using artificial intelligence (AI) We evaluated two popular deep learning models, Inception V3 and Xception, by training them on a large dataset of human decomposition images. The Xception model achieved the best classification performance, with macro-averaged F1 scores of.878,.881, and.702 for the head, torso, and limbs.
arXiv Detail & Related papers (2024-08-19T21:00:40Z)
Sources of Gain: Decomposing Performance in Conditional Average Dose Response Estimation [0.9332308328407303]
Estimating conditional average dose responses (CADR) is an important but challenging problem. Our paper analyses this practice and shows that using popular benchmark datasets without further analysis is insufficient to judge model performance. We propose a novel decomposition scheme that allows the evaluation of the impact of five distinct components contributing to CADR estimator performance.
arXiv Detail & Related papers (2024-06-12T13:39:32Z)
Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling [50.08315607506652]
We propose a Constrained Active Sampling Framework (CASF) for reliable human judgment. Experiment results show CASF receives 93.18% top-ranked system recognition accuracy.
arXiv Detail & Related papers (2024-06-12T07:44:36Z)
Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments. Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z)
UniHCP: A Unified Model for Human-Centric Perceptions [75.38263862084641]
We propose a Unified Model for Human-Centric Perceptions (UniHCP) UniHCP unifies a wide range of human-centric tasks in a simplified end-to-end manner with the plain vision transformer architecture. With large-scale joint training on 33 human-centric datasets, UniHCP can outperform strong baselines by direct evaluation.
arXiv Detail & Related papers (2023-03-06T07:10:07Z)
Real-Time Visual Feedback to Guide Benchmark Creation: A Human-and-Metric-in-the-Loop Workflow [22.540665278228975]
We propose VAIDA, a novel benchmark creation paradigm for NLP. VAIDA focuses on guiding crowdworkers, an under-explored facet of addressing benchmark idiosyncrasies. We find that VAIDA decreases effort, frustration, mental, and temporal demands of crowdworkers and analysts.
arXiv Detail & Related papers (2023-02-09T04:43:10Z)
Bottom-Up 2D Pose Estimation via Dual Anatomical Centers for Small-Scale Persons [75.86463396561744]
In multi-person 2D pose estimation, the bottom-up methods simultaneously predict poses for all persons. Our method achieves 38.4% improvement on bounding box precision and 39.1% improvement on bounding box recall over the state of the art (SOTA) For the human pose AP evaluation, we achieve a new SOTA (71.0 AP) on the COCO test-dev set with the single-scale testing.
arXiv Detail & Related papers (2022-08-25T10:09:10Z)
A Review for Deep Reinforcement Learning in Atari:Benchmarks, Challenges, and Solutions [0.0]
Arcade Learning Environment (ALE) is proposed as an evaluation platform for empirically assessing the generality of agents across Atari 2600 games. From Deep Q-Networks (DQN) to Agent57, RL agents seem to achieve superhuman performance in ALE. We propose a novel Atari benchmark based on human world records (HWR), which puts forward higher requirements for RL agents on both final performance and learning efficiency.
arXiv Detail & Related papers (2021-12-08T06:52:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.