A Rosetta Stone for AI Benchmarks
- URL: http://arxiv.org/abs/2512.00193v1
- Date: Fri, 28 Nov 2025 20:18:58 GMT
- Title: A Rosetta Stone for AI Benchmarks
- Authors: Anson Ho, Jean-Stanislas Denain, David Atanasov, Samuel Albanie, Rohin Shah,
- Abstract summary: Most AI benchmarks saturate within years or even months after they are introduced, making it hard to study long-run trends in AI capabilities.<n>We build a statistical framework that stitches benchmarks together, putting model capabilities and benchmark difficulties on a single numerical scale.<n>This acts as a "Rosetta Stone", allowing us to compare models across a wide range of abilities and time, even if they are not evaluated on the same benchmarks.
- Score: 28.690200241767897
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most AI benchmarks saturate within years or even months after they are introduced, making it hard to study long-run trends in AI capabilities. To address this challenge, we build a statistical framework that stitches benchmarks together, putting model capabilities and benchmark difficulties on a single numerical scale. This acts as a "Rosetta Stone", allowing us to compare models across a wide range of abilities and time, even if they are not evaluated on the same benchmarks. Moreover, this works without assuming how capabilities evolve across time or with training compute. We demonstrate three applications of this framework. First, we use it to measure the speed of AI progress over time, and to forecast future AI capabilities. Second, we estimate the rate of improvements in algorithmic efficiency, finding estimates that are higher, but broadly consistent with prior work. Finally, we find that our approach can be used to detect rapid accelerations in AI progress.
Related papers
- AI Agents as Universal Task Solvers [94.49762121230042]
We show that the optimal speed-up that a universal solver can achieve using past data is tightly related to their algorithmic information.<n>We argue that the key quantity to optimize when scaling reasoning models is time, whose critical role in learning has so far only been indirectly considered.
arXiv Detail & Related papers (2025-10-14T02:17:54Z) - Benchmark-Driven Selection of AI: Evidence from DeepSeek-R1 [0.0]
We show that better performance is not always caused by test-time algorithmic improvements or model sizes but also by using impactful benchmarks as curricula for learning.<n>We call this benchmark-driven selection of AI and show its effects on DeepSeek-R1 using our sequential decision-making problem from Humanity's Last Exam.
arXiv Detail & Related papers (2025-08-13T20:15:20Z) - Controlling Thinking Speed in Reasoning Models [57.14541748751654]
Human cognition operates in two modes: fast, intuitive System 1 thinking and slow, deliberate System 2 thinking.<n>In this work, we enable LRMs to approximate human intelligence through dynamic thinking speed adjustment.<n>Our approach addresses two key questions: (1) how to control thinking speed in LRMs, and (2) when to adjust it for optimal performance.
arXiv Detail & Related papers (2025-07-04T16:41:06Z) - Measuring AI Ability to Complete Long Tasks [5.986082428339293]
We measure the time humans typically take to complete tasks that AI models can complete with 50% success rate.<n>Current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes.<n>The increase in AI models' time horizons seems to be driven by greater reliability and ability to adapt to mistakes.
arXiv Detail & Related papers (2025-03-18T17:59:31Z) - General Scales Unlock AI Evaluation with Explanatory and Predictive Power [57.7995945974989]
benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems.<n>We introduce general scales for AI evaluation that can explain what common AI benchmarks really measure.<n>Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate.
arXiv Detail & Related papers (2025-03-09T01:13:56Z) - Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation? [90.30635552818875]
We present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs.<n>This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals.<n>We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets.
arXiv Detail & Related papers (2024-11-06T05:09:34Z) - Benchmarking Neural Network Training Algorithms [52.890134877995195]
Training algorithms are an essential part of every deep learning pipeline.<n>As a community, we are unable to reliably identify training algorithm improvements.<n>We introduce a new, competitive, time-to-result benchmark using multiple workloads running on fixed hardware.
arXiv Detail & Related papers (2023-06-12T15:21:02Z) - Mystique: Enabling Accurate and Scalable Generation of Production AI
Benchmarks [2.0315147707806283]
Mystique is an accurate and scalable framework for production AI benchmark generation.
Mystique is scalable, due to its lightweight data collection, in terms of overhead runtime and instrumentation effort.
We evaluate our methodology on several production AI models, and show that benchmarks generated with Mystique closely resemble original AI models.
arXiv Detail & Related papers (2022-12-16T18:46:37Z) - Position: Tensor Networks are a Valuable Asset for Green AI [7.066223472133622]
This position paper introduces a fundamental link between tensor networks (TNs) and Green AI.
We argue that TNs are valuable for Green AI due to their strong mathematical backbone and inherent logarithmic compression potential.
arXiv Detail & Related papers (2022-05-25T14:02:49Z) - Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples.
We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment.
We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z) - AIPerf: Automated machine learning as an AI-HPC benchmark [17.57686674304368]
We propose an end-to-end benchmark suite utilizing automated machine learning (AutoML)
We implement the algorithms in a highly parallel and flexible way to ensure the efficiency and optimization potential on diverse systems.
With flexible workload and single metric, our benchmark can scale and rank AI- HPC easily.
arXiv Detail & Related papers (2020-08-17T08:06:43Z) - AIBench Training: Balanced Industry-Standard AI Training Benchmarking [26.820244556465333]
Earlier-stage evaluations of a new AI architecture/system need affordable benchmarks.
We use real-world benchmarks to cover the factors space that impacts the learning dynamics.
We contribute by far the most comprehensive AI training benchmark suite.
arXiv Detail & Related papers (2020-04-30T11:08:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.