Are Large Reasoning Models Interruptible?
- URL: http://arxiv.org/abs/2510.11713v3
- Date: Thu, 16 Oct 2025 17:59:24 GMT
- Title: Are Large Reasoning Models Interruptible?
- Authors: Tsung-Han Wu, Mihran Miroyan, David M. Chan, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez,
- Abstract summary: Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, "frozen world" settings.<n>We show that even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context.<n>Our analysis further reveals several novel failure modes, including reasoning leakage, panic, and self-doubt.
- Score: 77.53059044071107
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, "frozen world" settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the duration of the response. While generally true for short-term tasks, the "frozen world" assumption breaks down in modern reasoning tasks such as assistive programming, where models may take hours to think through problems and code may change dramatically from the time the model starts thinking to the model's final output. In this work, we challenge the frozen world assumption and evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the quality of the model's partial outputs on a limited budget, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades while incorporating updated information. Project Page: http://dynamic-lm.github.io/
Related papers
- SynTSBench: Rethinking Temporal Pattern Learning in Deep Learning Models for Time Series [11.314952720053464]
We propose a synthetic data-driven evaluation paradigm, SynTSBench, for time series forecasting models.<n>Our framework isolates confounding factors and establishes an interpretable evaluation system with three core analytical dimensions.<n>Our experiments show that current deep learning models do not universally approach optimal baselines across all types of temporal features.
arXiv Detail & Related papers (2025-10-23T06:59:38Z) - MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning [61.04601861108966]
We propose MorphoBench, a benchmark that incorporates multidisciplinary questions to evaluate the reasoning capabilities of large models.<n>MorphoBench adaptively modifies the analytical challenge of questions by leveraging key statements generated during the model's reasoning process.<n>We have gathered over 1,300 test questions and iteratively adjusted the difficulty of MorphoBench based on the reasoning capabilities of models such as o3 and GPT-5.
arXiv Detail & Related papers (2025-10-16T03:30:56Z) - Limits of Emergent Reasoning of Large Language Models in Agentic Frameworks for Deterministic Games [2.357397994148727]
Large Reasoning Models (LRMs) undergo a collapse in performance on solving puzzles beyond certain perplexity thresholds.<n>We provide a large language model (LLM) with an environment interface for Tower of Hanoi problems.<n>We observe that access to an environment interface does not delay or eradicate performance collapse.
arXiv Detail & Related papers (2025-10-12T23:48:16Z) - ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models [102.4511331368587]
ARISE (Adaptive Resolution-aware Scaling Evaluation) is a novel metric designed to assess the test-time scaling effectiveness of large reasoning models.<n>We conduct comprehensive experiments evaluating state-of-the-art reasoning models across diverse domains.
arXiv Detail & Related papers (2025-10-07T15:10:51Z) - Inverse Scaling in Test-Time Compute [51.16323216811257]
Extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance.<n>We identify five distinct failure modes when models reason for longer.<n>These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns.
arXiv Detail & Related papers (2025-07-19T00:06:13Z) - History-Aware Neural Operator: Robust Data-Driven Constitutive Modeling of Path-Dependent Materials [8.579506050944875]
This study presents an end-to-end learning framework for data-driven modeling of inelastic materials using neural operators.<n>We develop the History-Aware Neural Operator (HANO), an autoregressive model that predicts path-dependent material responses from short segments of recent strain-stress history.<n>We evaluate HANO on two benchmark problems: elastoplasticity with hardening and progressive anisotropic damage in brittle solids.
arXiv Detail & Related papers (2025-06-12T05:19:17Z) - Self-Route: Automatic Mode Switching via Capability Estimation for Efficient Reasoning [36.470695895695044]
Self-Route is a dynamic reasoning framework that automatically selects between general and reasoning modes.<n>We show that Self-Route achieves comparable accuracy to reasoning models while reducing token consumption by 30-55%.
arXiv Detail & Related papers (2025-05-27T03:18:31Z) - Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models.
We first propose AutoMathCritique, an automated and scalable framework for collecting critique data.
We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z) - Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models [13.532180752491954]
Large Language Models (LLMs) are often described as instances of foundation models that possess strong generalization obeying scaling laws.<n>We demonstrate here a dramatic breakdown of generalization and basic reasoning of all SOTA models claiming strong function.<n>We also observe strong overconfidence in the wrong solutions, expressed in form of plausible sounding explanation-like confabulations.
arXiv Detail & Related papers (2024-06-04T07:43:33Z) - Stateful Offline Contextual Policy Evaluation and Learning [88.9134799076718]
We study off-policy evaluation and learning from sequential data.
We formalize the relevant causal structure of problems such as dynamic personalized pricing.
We show improved out-of-sample policy performance in this class of relevant problems.
arXiv Detail & Related papers (2021-10-19T16:15:56Z) - Consistent Counterfactuals for Deep Models [25.1271020453651]
Counterfactual examples are used to explain predictions of machine learning models in key areas such as finance and medical diagnosis.
This paper studies the consistency of model prediction on counterfactual examples in deep networks under small changes to initial training conditions.
arXiv Detail & Related papers (2021-10-06T23:48:55Z) - Closed-form Continuous-Depth Models [99.40335716948101]
Continuous-depth neural models rely on advanced numerical differential equation solvers.
We present a new family of models, termed Closed-form Continuous-depth (CfC) networks, that are simple to describe and at least one order of magnitude faster.
arXiv Detail & Related papers (2021-06-25T22:08:51Z) - Anomaly Detection of Time Series with Smoothness-Inducing Sequential
Variational Auto-Encoder [59.69303945834122]
We present a Smoothness-Inducing Sequential Variational Auto-Encoder (SISVAE) model for robust estimation and anomaly detection of time series.
Our model parameterizes mean and variance for each time-stamp with flexible neural networks.
We show the effectiveness of our model on both synthetic datasets and public real-world benchmarks.
arXiv Detail & Related papers (2021-02-02T06:15:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.