Related papers: The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation

The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation

URL: http://arxiv.org/abs/2511.01365v1
Date: Mon, 03 Nov 2025 09:09:29 GMT
Title: The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation
Authors: İbrahim Ethem Deveci, Duygu Ataman,
Abstract summary: We discuss whether surpassing a benchmark truly demonstrates reasoning ability or are we simply tracking numbers divorced from the capabilities we claim to measure?<n>We present an investigation focused on three model families, OpenAI, Anthropic, and Google, and how their reasoning capabilities evolve over the years.
Score: 1.2324085268373774
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid rise of Large Language Models (LLMs) and Large Reasoning Models (LRMs) has been accompanied by an equally rapid increase of benchmarks used to assess them. However, due to both improved model competence resulting from scaling and novel training advances as well as likely many of these datasets being included in pre or post training data, results become saturated, driving a continuous need for new and more challenging replacements. In this paper, we discuss whether surpassing a benchmark truly demonstrates reasoning ability or are we simply tracking numbers divorced from the capabilities we claim to measure? We present an investigation focused on three model families, OpenAI, Anthropic, and Google, and how their reasoning capabilities across different benchmarks evolve over the years. We also analyze performance trends over the years across different reasoning tasks and discuss the current situation of benchmarking and remaining challenges. By offering a comprehensive overview of benchmarks and reasoning tasks, our work aims to serve as a first reference to ground future research in reasoning evaluation and model development.

Related papers

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation [80.66788281323414]
We analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers.<n>Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age.<n>Expert-curated benchmarks resist saturation better than crowdsourced ones.
arXiv Detail & Related papers (2026-02-18T16:51:37Z)
MACEval: A Multi-Agent Continual Evaluation Network for Large Models [52.629762680215315]
We introduce MACEval, a Multi-Agent Continual Evaluation network for dynamic evaluation of large models.<n>We demonstrate that MACEval is (1) human-free and automatic, mitigating laborious result processing with inter-agent judgment guided; (2) efficient and economical, reducing a considerable amount of data and overhead to obtain similar results compared to related benchmarks; and (3) flexible and scalable, migrating or integrating existing benchmarks via customized evaluation topologies.
arXiv Detail & Related papers (2025-11-12T09:26:24Z)
A Survey on Large Language Model Benchmarks [45.042853171973086]
General capability benchmarks cover aspects such as core linguistics, knowledge, and reasoning.<n> domain-specific benchmarks focus on fields like natural sciences, humanities and social sciences, and engineering technology.<n>Target-specific benchmarks pay attention to risks, reliability, agents, etc.
arXiv Detail & Related papers (2025-08-21T08:43:35Z)
Benchmark-Driven Selection of AI: Evidence from DeepSeek-R1 [0.0]
We show that better performance is not always caused by test-time algorithmic improvements or model sizes but also by using impactful benchmarks as curricula for learning.<n>We call this benchmark-driven selection of AI and show its effects on DeepSeek-R1 using our sequential decision-making problem from Humanity's Last Exam.
arXiv Detail & Related papers (2025-08-13T20:15:20Z)
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks [229.73714829399802]
This survey probes the core challenges that the rise of Large Language Models poses for evaluation.<n>We identify and analyze two pivotal transitions: (i) from task-specific to capability-based evaluation, which reorganizes benchmarks around core competencies such as knowledge, reasoning, instruction following, multi-modal understanding, and safety.<n>We will dissect this issue, along with the core challenges of the above two transitions, from the perspectives of methods, datasets, evaluators, and metrics.
arXiv Detail & Related papers (2025-04-26T07:48:52Z)
Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability [16.441081996257576]
We propose leveraging reasoning-intensive models to improve less computationally demanding, non-reasoning models.<n>We demonstrate consistent improvements across various benchmarks, underscoring the potential of this approach for advancing the ability of models to answer questions directly.
arXiv Detail & Related papers (2025-04-13T16:26:56Z)
Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems [92.89673285398521]
o1-like reasoning systems have demonstrated remarkable capabilities in solving complex reasoning tasks.<n>We introduce an imitate, explore, and self-improve'' framework to train the reasoning model.<n>Our approach achieves competitive performance compared to industry-level reasoning systems.
arXiv Detail & Related papers (2024-12-12T16:20:36Z)
Eureka: Evaluating and Understanding Large Foundation Models [23.020996995362104]
We present Eureka, an open-source framework for standardizing evaluations of large foundation models beyond single-score reporting and rankings. We conduct an analysis of 12 state-of-the-art models, providing in-depth insights into failure understanding and model comparison.
arXiv Detail & Related papers (2024-09-13T18:01:49Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs. We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.