Benchmark-Driven Selection of AI: Evidence from DeepSeek-R1
- URL: http://arxiv.org/abs/2508.10173v1
- Date: Wed, 13 Aug 2025 20:15:20 GMT
- Title: Benchmark-Driven Selection of AI: Evidence from DeepSeek-R1
- Authors: Petr Spelda, Vit Stritecky,
- Abstract summary: We show that better performance is not always caused by test-time algorithmic improvements or model sizes but also by using impactful benchmarks as curricula for learning.<n>We call this benchmark-driven selection of AI and show its effects on DeepSeek-R1 using our sequential decision-making problem from Humanity's Last Exam.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluation of reasoning language models gained importance after it was observed that they can combine their existing capabilities into novel traces of intermediate steps before task completion and that the traces can sometimes help them to generalize better than past models. As reasoning becomes the next scaling dimension of large language models, careful study of their capabilities in critical tasks is needed. We show that better performance is not always caused by test-time algorithmic improvements or model sizes but also by using impactful benchmarks as curricula for learning. We call this benchmark-driven selection of AI and show its effects on DeepSeek-R1 using our sequential decision-making problem from Humanity's Last Exam. Steering development of AI by impactful benchmarks trades evaluation for learning and makes novelty of test tasks key for measuring generalization capabilities of reasoning models. Consequently, some benchmarks could be seen as curricula for training rather than unseen test sets.
Related papers
- Learning to Reason: Training LLMs with GPT-OSS or DeepSeek R1 Reasoning Traces [2.0789230137053014]
Test-time scaling has enabled a new class of Large Language Models (LLMs) that are able to reason through complex problems.<n>We compare the performance of medium-sized LLMs on Math problems after post-training on two kinds of reasoning traces.
arXiv Detail & Related papers (2025-11-24T17:26:58Z) - The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation [1.2324085268373774]
We discuss whether surpassing a benchmark truly demonstrates reasoning ability or are we simply tracking numbers divorced from the capabilities we claim to measure?<n>We present an investigation focused on three model families, OpenAI, Anthropic, and Google, and how their reasoning capabilities evolve over the years.
arXiv Detail & Related papers (2025-11-03T09:09:29Z) - ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models [102.4511331368587]
ARISE (Adaptive Resolution-aware Scaling Evaluation) is a novel metric designed to assess the test-time scaling effectiveness of large reasoning models.<n>We conduct comprehensive experiments evaluating state-of-the-art reasoning models across diverse domains.
arXiv Detail & Related papers (2025-10-07T15:10:51Z) - Large Language Models Often Know When They Are Being Evaluated [0.015534429177540245]
We investigate whether frontier language models can accurately classify transcripts based on whether they originate from evaluations or real-world deployment.<n>We construct a benchmark of 1,000 prompts and transcripts from 61 distinct datasets.<n>Our results indicate that frontier models already exhibit a substantial, though not yet, level of evaluation-awareness.
arXiv Detail & Related papers (2025-05-28T12:03:09Z) - Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference.<n>This paper presents the first comprehensive study on the prevalent issue of overthinking in these models.<n>We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z) - Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models.
We first propose AutoMathCritique, an automated and scalable framework for collecting critique data.
We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z) - Are Visual-Language Models Effective in Action Recognition? A Comparative Study [22.97135293252601]
This paper provides a large-scale study and insight on state-of-the-art vision foundation models.
It compares their transfer ability onto zero-shot and frame-wise action recognition tasks.
Experiments are conducted on recent fine-grained, human-centric action recognition datasets.
arXiv Detail & Related papers (2024-10-22T16:28:21Z) - Models of reference production: How do they withstand the test of time? [6.651864489482537]
We use the task of generating referring expressions in context as a case study and start our analysis from GREC.
We ask what the performance of models would be if we assessed them on more realistic datasets.
We conclude that GREC can no longer be regarded as offering a reliable assessment of models' ability to mimic human reference production.
arXiv Detail & Related papers (2023-07-27T12:46:38Z) - Models, Pixels, and Rewards: Evaluating Design Trade-offs in Visual
Model-Based Reinforcement Learning [109.74041512359476]
We study a number of design decisions for the predictive model in visual MBRL algorithms.
We find that a range of design decisions that are often considered crucial, such as the use of latent spaces, have little effect on task performance.
We show how this phenomenon is related to exploration and how some of the lower-scoring models on standard benchmarks will perform the same as the best-performing models when trained on the same training data.
arXiv Detail & Related papers (2020-12-08T18:03:21Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z) - Rethinking Generalization of Neural Models: A Named Entity Recognition
Case Study [81.11161697133095]
We take the NER task as a testbed to analyze the generalization behavior of existing models from different perspectives.
Experiments with in-depth analyses diagnose the bottleneck of existing neural NER models.
As a by-product of this paper, we have open-sourced a project that involves a comprehensive summary of recent NER papers.
arXiv Detail & Related papers (2020-01-12T04:33:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.