Related papers: REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

URL: http://arxiv.org/abs/2507.10541v2
Date: Tue, 15 Jul 2025 06:16:53 GMT
Title: REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once
Authors: Zhuoshi Pan, Qizhi Pei, Yu Li, Qiyao Sun, Zinan Tang, H. Vicky Zhao, Conghui He, Lijun Wu,
Abstract summary: We present REST (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that exposes Large Reasoning Models to multiple problems simultaneously.<n>Our evaluation reveals several striking findings: Even state-of-the-art (SOTA) models like DeepSeek-R1 exhibit substantial performance degradation under stress testing.
Score: 33.049237516125146
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-question reasoning through sequential testing, resulting critical limitations: (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly creation of new questions with large human efforts, (2) failure to evaluate models under multi-context pressure, a key requirement for real-world deployment. To bridge this gap, we present REST (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that exposes LRMs to multiple problems simultaneously. Beyond basic reasoning, REST evaluates several under-tested capabilities: contextual priority allocation, cross-problem interference resistance, and dynamic cognitive load management. Our evaluation reveals several striking findings: Even state-of-the-art (SOTA) models like DeepSeek-R1 exhibit substantial performance degradation under stress testing. Crucially, REST demonstrates stronger discriminative power than existing benchmarks, revealing pronounced performance differences among models that exhibit similar, near-ceiling performance under single-question evaluations. Some key insights emerge from our analysis: (1) the "overthinking trap" is a critical factor contributing to the performance degradation; (2) the models trained with "long2short" technique preserve more accuracy of their single-problem performance under REST, outperforming standard-trained counterparts. These results establish REST as a cost-efficient, future-proof evaluation paradigm that better reflects real-world reasoning demands while reducing reliance on continuous human annotation. Code and results are available at https://opendatalab.github.io/REST.

Related papers

VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks [25.295071827427677]
emphbenchmark contamination arises from the public availability of test problems.<n>emphevaluation fragility stems from the reliance on single-instance assessments.<n>IME-MATH is a symbolic evaluation framework designed to probe genuine reasoning ability.
arXiv Detail & Related papers (2025-07-17T08:10:55Z)
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination [67.67725938962798]
Pre-training on massive web-scale corpora leaves Qwen2.5 susceptible to data contamination in widely used benchmarks.<n>We introduce a generator that creates fully clean arithmetic problems of arbitrary length and difficulty, dubbed RandomCalculation.<n>We show that only accurate reward signals yield steady improvements that surpass the base model's performance boundary.
arXiv Detail & Related papers (2025-07-14T17:55:15Z)
Dissecting Long Reasoning Models: An Empirical Study [94.31064312707211]
We systematically analyze the roles of positive and negative samples in reinforcement learning (RL)<n>We identify substantial data inefficiency in group relative policy optimization, where over half of the samples yield zero advantage.<n>We investigate unstable performance across various reasoning models and benchmarks, attributing instability to uncertain problems with ambiguous outcomes.
arXiv Detail & Related papers (2025-06-05T11:47:10Z)
T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation [60.620408007636016]
We propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores.<n>Our approach integrates Group Relative Policy Optimization into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains.
arXiv Detail & Related papers (2025-05-23T13:44:59Z)
The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination [18.05548914181797]
Benchmark Data Contamination (BDC)-the inclusion of benchmark testing samples in the training set-has raised increasing concerns in Large Language Model (LLM) evaluation.<n>To address this, researchers have proposed various mitigation strategies to update existing benchmarks, including modifying original questions or generating new ones based on them.<n>Previous assessment methods, such as accuracy drop and accuracy matching, focus solely on aggregate accuracy, often leading to incomplete or misleading conclusions.
arXiv Detail & Related papers (2025-03-20T17:55:04Z)
Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution [66.11004226578771]
Existing robust benchmark datasets have two key limitations.<n>They generate only a limited range of perturbations for a single Information Extraction (IE) task.<n>Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench.<n>We show that training with only textbf15% of the data leads to an average textbf7.5% relative performance improvement across three IE tasks.
arXiv Detail & Related papers (2025-03-05T05:39:29Z)
LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems [7.379503137362718]
We introduce LR$2$Bench, a novel benchmark designed to evaluate the Long-chain Reflective Reasoning capabilities of Large Language Models.<n>Our evaluation reveals that even the most advanced LRMs, such as DeepSeek-R1 and OpenAI o1-preview, struggle with tasks in LR$2$Bench.
arXiv Detail & Related papers (2025-02-25T04:51:17Z)
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning [53.25336975467293]
We present the first theoretical error decomposition analysis of methods such as perplexity and self-consistency.<n>Our analysis reveals a fundamental trade-off: perplexity methods suffer from substantial model error due to the absence of a proper consistency function.<n>We propose Reasoning-Pruning Perplexity Consistency (RPC), which integrates perplexity with self-consistency, and Reasoning Pruning, which eliminates low-probability reasoning paths.
arXiv Detail & Related papers (2025-02-01T18:09:49Z)
Controlling Risk of Retrieval-augmented Generation: A Counterfactual Prompting Framework [77.45983464131977]
We focus on how likely it is that a RAG model's prediction is incorrect, resulting in uncontrollable risks in real-world applications.<n>Our research identifies two critical latent factors affecting RAG's confidence in its predictions.<n>We develop a counterfactual prompting framework that induces the models to alter these factors and analyzes the effect on their answers.
arXiv Detail & Related papers (2024-09-24T14:52:14Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
Deep anytime-valid hypothesis testing [29.273915933729057]
We propose a general framework for constructing powerful, sequential hypothesis tests for nonparametric testing problems. We develop a principled approach of leveraging the representation capability of machine learning models within the testing-by-betting framework. Empirical results on synthetic and real-world datasets demonstrate that tests instantiated using our general framework are competitive against specialized baselines.
arXiv Detail & Related papers (2023-10-30T09:46:19Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.