Related papers: When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance

When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance

URL: http://arxiv.org/abs/2509.22193v1
Date: Fri, 26 Sep 2025 10:53:52 GMT
Title: When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance
Authors: Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Kevin El-Haddad, Céline Hudelot, Pierre Colombo,
Abstract summary: We compare Instruction Fine-Tuning (IFT) and reasoning models of varying sizes, on a wide range of math-centric and general-purpose tasks.<n>Our analysis reveals that reasoning consistently improves model performance, often matching or surpassing significantly larger IFT systems.
Score: 12.583725308641633
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) with reasoning capabilities have achieved state-of-the-art performance on a wide range of tasks. Despite its empirical success, the tasks and model scales at which reasoning becomes effective, as well as its training and inference costs, remain underexplored. In this work, we rely on a synthetic data distillation framework to conduct a large-scale supervised study. We compare Instruction Fine-Tuning (IFT) and reasoning models of varying sizes, on a wide range of math-centric and general-purpose tasks, evaluating both multiple-choice and open-ended formats. Our analysis reveals that reasoning consistently improves model performance, often matching or surpassing significantly larger IFT systems. Notably, while IFT remains Pareto-optimal in training and inference costs, reasoning models become increasingly valuable as model size scales, overcoming IFT performance limits on reasoning-intensive and open-ended tasks.

Related papers

Understanding the Implicit Biases of Design Choices for Time Series Foundation Models [90.894232610821]
Time series foundation models (TSFMs) are a class of potentially powerful, general-purpose tools for time series forecasting and related temporal tasks.<n>Their behavior is strongly shaped by subtle inductive biases in their design.<n>We show how these biases can be intuitive or very counterintuitive, depending on properties of the model and data.
arXiv Detail & Related papers (2025-10-22T04:42:35Z)
The Thinking Spectrum: An Empirical Study of Tunable Reasoning in LLMs through Model Merging [8.930191971732649]
We present a large-scale empirical study evaluating a range of model merging techniques across multiple reasoning benchmarks.<n>Our findings reveal that model merging offers an effective and controllable method for calibrating the trade-off between reasoning accuracy and token efficiency.<n>Our study provides the first comprehensive analysis of this tunable space, offering practical guidelines for creating LLMs with specific reasoning profiles.
arXiv Detail & Related papers (2025-09-26T08:12:13Z)
NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks [65.70224757972068]
We select reasoning traces from a strong teacher model based on a large pool of questions from NaturalReasoning.<n>We find that simply scaling up data size with random sampling is a strong baseline with steady performance gains.<n>We find that selecting difficult examples that require more diverse reasoning strategies is more sample-efficient to transfer the teacher model's reasoning skills.
arXiv Detail & Related papers (2025-07-02T17:30:24Z)
Safety Through Reasoning: An Empirical Study of Reasoning Guardrail Models [3.102576158218633]
Reasoning-based language models have demonstrated strong performance across various domains.<n>Recent research has shown that reasoning also offers significant benefits for safety and guardrail applications.<n>Our study focuses on two key dimensions: data efficiency and inference efficiency.
arXiv Detail & Related papers (2025-05-26T15:01:37Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models [48.98109982725689]
We conduct the first systematic study on quantized reasoning models.<n>Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths.<n>We identify model size, model origin, and task difficulty as critical determinants of performance.
arXiv Detail & Related papers (2025-04-07T08:22:45Z)
Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead [33.011660907969706]
Inference-time scaling can enhance the reasoning capabilities of large language models.<n>We investigate the benefits and limitations of scaling methods across nine state-of-the-art models and eight challenging tasks.
arXiv Detail & Related papers (2025-03-31T23:40:28Z)
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models [49.61246073215651]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks.<n>Recent advancements in OpenAI o1 and DeepSeek-R1 have further improved performance in System-2 reasoning domains.<n>However, they also introduce significant computational overhead due to verbose and redundant outputs.
arXiv Detail & Related papers (2025-03-20T17:59:38Z)
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [108.07030347318624]
We show that scaling with longer Chain of Thoughts (CoTs) can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.<n>We propose a Thinking- Optimal Scaling strategy to teach models to adopt different reasoning efforts for deep thinking.<n>Our self-improvement models built upon Qwen2.5-32B-Instruct outperform other distillation-based 32B o1-like models across various math benchmarks.
arXiv Detail & Related papers (2025-02-25T10:48:05Z)
Training Language Models to Reason Efficiently [14.390800014819439]
We use reinforcement learning to train large reasoning models to reason efficiently.<n>Our method incentivizes models to minimize unnecessary computational overhead while maintaining accuracy.<n> Experiments on two open-weight large reasoning models demonstrate significant reductions in inference cost while preserving most of the accuracy.
arXiv Detail & Related papers (2025-02-06T19:18:16Z)
Optimizing Sequential Recommendation Models with Scaling Laws and Approximate Entropy [104.48511402784763]
Performance Law for SR models aims to theoretically investigate and model the relationship between model performance and data quality.<n>We propose Approximate Entropy (ApEn) to assess data quality, presenting a more nuanced approach compared to traditional data quantity metrics.
arXiv Detail & Related papers (2024-11-30T10:56:30Z)
Scaling Laws Across Model Architectures: A Comparative Analysis of Dense and MoE Models in Large Language Models [34.79589443380606]
The scaling of large language models (LLMs) is a critical research area for the efficiency and effectiveness of model training and deployment. Our work investigates the transferability and discrepancies of scaling laws between Dense Models and MoE models.
arXiv Detail & Related papers (2024-10-08T03:21:56Z)
Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing [61.98556945939045]
We propose a framework to learn planning-based reasoning through Direct Preference Optimization (DPO) on collected trajectories. Our results on challenging logical reasoning benchmarks demonstrate the effectiveness of our learning framework.
arXiv Detail & Related papers (2024-02-01T15:18:33Z)
Feeding What You Need by Understanding What You Learned [54.400455868448695]
Machine Reading (MRC) reveals the ability to understand a given text passage and answer questions based on it. Existing research works in MRC rely heavily on large-size models and corpus to improve the performance evaluated by metrics such as Exact Match. We argue that a deep understanding of model capabilities and data properties can help us feed a model with appropriate training data.
arXiv Detail & Related papers (2022-03-05T14:15:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.