Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models
- URL: http://arxiv.org/abs/2505.14810v2
- Date: Sun, 25 May 2025 14:52:51 GMT
- Title: Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models
- Authors: Tingchen Fu, Jiawei Gu, Yafu Li, Xiaoye Qu, Yu Cheng,
- Abstract summary: We introduce MathIF, a benchmark for evaluating instruction-following in mathematical reasoning tasks.<n>Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability.<n>We show that even simple interventions can partially recover obedience, though at the cost of reasoning performance.
- Score: 27.142703756752997
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Instruction-following is essential for aligning large language models (LLMs) with user intent. While recent reasoning-oriented models exhibit impressive performance on complex mathematical problems, their ability to adhere to natural language instructions remains underexplored. In this work, we introduce MathIF, a dedicated benchmark for evaluating instruction-following in mathematical reasoning tasks. Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability, as models that reason more effectively often struggle to comply with user directives. We find that models tuned on distilled long chains-of-thought or trained with reasoning-oriented reinforcement learning often degrade in instruction adherence, especially when generation length increases. Furthermore, we show that even simple interventions can partially recover obedience, though at the cost of reasoning performance. These findings highlight a fundamental tension in current LLM training paradigms and motivate the need for more instruction-aware reasoning models. We release the code and data at https://github.com/TingchenFu/MathIF.
Related papers
- Inverse Scaling in Test-Time Compute [51.16323216811257]
Extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance.<n>We identify five distinct failure modes when models reason for longer.<n>These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns.
arXiv Detail & Related papers (2025-07-19T00:06:13Z) - Reasoning Model Unlearning: Forgetting Traces, Not Just Answers, While Preserving Reasoning Skills [32.96074934023323]
Large reasoning models (LRMs) have enabled strong chain-of-thought (CoT) generation through test-time computation.<n>We show that conventional unlearning algorithms, originally designed for non-reasoning models, are inadequate for LRMs.<n>We propose Reasoning-aware Representation Misdirection for Unlearning ($R2MU$), a novel method that effectively suppresses sensitive reasoning traces.
arXiv Detail & Related papers (2025-06-15T20:54:23Z) - When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs [16.659986373052217]
Chain-of-thought reasoning can significantly degrade instruction-following accuracy.<n>This is the first work to systematically expose reasoning-induced failures in instruction-following.
arXiv Detail & Related papers (2025-05-16T16:36:00Z) - Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z) - The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction [34.86855316803838]
We identify a set of linear features in the model's residual stream that govern the balance between genuine reasoning and memory recall.<n>We show that intervening in these reasoning features helps the model more accurately activate the most relevant problem-solving capabilities during answer generation.
arXiv Detail & Related papers (2025-03-29T14:00:44Z) - Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models [54.04678363287392]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks.<n>Recent advancements in OpenAI o1 and DeepSeek-R1 have further improved performance in System-2 reasoning domains.
arXiv Detail & Related papers (2025-03-20T17:59:38Z) - Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models.
We first propose AutoMathCritique, an automated and scalable framework for collecting critique data.
We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z) - Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models [56.34029644009297]
Large language models (LLMs) have demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems.
LLMs excel most in abductive reasoning, followed by deductive reasoning, while they are least effective at inductive reasoning.
We study single-task training, multi-task training, and "chain-of-thought" knowledge distillation fine-tuning technique to assess the performance of model.
arXiv Detail & Related papers (2023-10-02T01:00:50Z) - Improving Long-Horizon Imitation Through Instruction Prediction [93.47416552953075]
In this work, we explore the use of an often unused source of auxiliary supervision: language.
Inspired by recent advances in transformer-based models, we train agents with an instruction prediction loss that encourages learning temporally extended representations that operate at a high level of abstraction.
In further analysis we find that instruction modeling is most important for tasks that require complex reasoning, while understandably offering smaller gains in environments that require simple plans.
arXiv Detail & Related papers (2023-06-21T20:47:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.