Related papers: BRIDGE: Predicting Human Task Completion Time From Model Performance

BRIDGE: Predicting Human Task Completion Time From Model Performance

URL: http://arxiv.org/abs/2602.07267v1
Date: Fri, 06 Feb 2026 23:36:11 GMT
Title: BRIDGE: Predicting Human Task Completion Time From Model Performance
Authors: Fengyuan Liu, Jay Gala, Nilaksh, Dzmitry Bahdanau, Siva Reddy, Hugo Larochelle,
Abstract summary: Existing approaches that rely on direct human task completion time annotations are costly, noisy, and difficult to scale across benchmarks.<n>We propose BRIDGE, a unified psychometric framework that learns the latent difficulty scale from model responses and anchors it to human task completion time.
Score: 36.36759710005444
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating the real-world capabilities of AI systems requires grounding benchmark performance in human-interpretable measures of task difficulty. Existing approaches that rely on direct human task completion time annotations are costly, noisy, and difficult to scale across benchmarks. In this work, we propose BRIDGE, a unified psychometric framework that learns the latent difficulty scale from model responses and anchors it to human task completion time. Using a two-parameter logistic Item Response Theory model, we jointly estimate latent task difficulty and model capability from model performance data across multiple benchmarks. We demonstrate that latent task difficulty varies linearly with the logarithm of human completion time, allowing human task completion time to be inferred for new benchmarks from model performance alone. Leveraging this alignment, we forecast frontier model capabilities in terms of human task length and independently reproduce METR's exponential scaling results, with the 50% solvable task horizon doubling approximately every 6 months.

Related papers

Scaling Tasks, Not Samples: Mastering Humanoid Control through Multi-Task Model-Based Reinforcement Learning [49.82882141491629]
We argue that effective online learning should scale the emphnumber of tasks, rather than the number of samples per task.<n>This regime reveals a structural advantage of model-based reinforcement learning.<n>We instantiate this idea with textbfEfficientZero-Multitask (EZ-M), a sample-efficient multi-task algorithm for online learning.
arXiv Detail & Related papers (2026-03-02T05:07:43Z)
Evaluating Few-Shot Temporal Reasoning of LLMs for Human Activity Prediction in Smart Environments [1.411614392022118]
Existing data-driven agent-based models struggle in low-data environments.<n>This paper investigates whether large language models, pre-trained on broad human knowledge, can fill this gap.
arXiv Detail & Related papers (2026-01-20T20:58:17Z)
Error-driven Data-efficient Large Multimodal Model Tuning [35.20400815089843]
Large Multimodal Models (LMMs) have demonstrated impressive performance across numerous academic benchmarks.<n>We propose an error-driven data-efficient tuning framework that aims to efficiently adapt generic LMMs to newly emerging tasks.
arXiv Detail & Related papers (2024-12-20T08:07:11Z)
Optimizing Locomotor Task Sets in Biological Joint Moment Estimation for Hip Exoskeleton Applications [0.0]
We introduce a locomotor task set optimization strategy to identify a minimal, yet representative, set of tasks that preserves model performance.<n>Our results demonstrate the ability to maintain model accuracy while significantly reducing the cost associated with data collection and model training.
arXiv Detail & Related papers (2024-12-10T17:29:21Z)
PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks [57.89516354418451]
We present a benchmark for Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR) We employ a semi-automated task generation pipeline using Large Language Models (LLMs) We analyze state-of-the-art LLMs on PARTNR tasks, across the axes of planning, perception and skill execution.
arXiv Detail & Related papers (2024-10-31T17:53:12Z)
Plots Unlock Time-Series Understanding in Multimodal Models [5.792074027074628]
This paper proposes a method that leverages the existing vision encoders of multimodal foundation models to "see" time-series data via plots.<n>Our empirical evaluations show that this approach outperforms providing the raw time-series data as text.<n>To demonstrate generalizability from synthetic tasks with clear reasoning steps to more complex, real-world scenarios, we apply our approach to consumer health tasks.
arXiv Detail & Related papers (2024-10-03T16:23:13Z)
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [115.501751261878]
Fine-tuning language models(LMs) on human-generated data remains a prevalent practice. We investigate whether we can go beyond human data on tasks where we have access to scalar feedback. We find that ReST$EM$ scales favorably with model size and significantly surpasses fine-tuning only on human data.
arXiv Detail & Related papers (2023-12-11T18:17:43Z)
Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models [80.23791222509644]
Inconsistent AI models are considered brittle and untrustworthy by human users. We find that state-of-the-art vision-language models suffer from a surprisingly high degree of inconsistent behavior across tasks. We propose a rank correlation-based auxiliary training objective, computed over large automatically created cross-task contrast sets.
arXiv Detail & Related papers (2023-03-28T16:57:12Z)
SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks. We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain. We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.