Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
- URL: http://arxiv.org/abs/2507.00432v1
- Date: Tue, 01 Jul 2025 05:23:05 GMT
- Title: Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
- Authors: Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, Xiang Yue,
- Abstract summary: We evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks.<n>We find that most models that succeed in math fail to transfer their gains to other domains.<n>Our results suggest a need to rethink standard post-training recipes.
- Score: 46.22610560773869
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.
Related papers
- Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them [25.324955028065887]
Two popular approaches are reinforcement learning (RL) and supervised fine-tuning (SFT)<n>We find that RL yields minor in-domain gains on maths and slight degradation on knowledge-intensive benchmarks like MMLU.<n>SFT exhibits greater updates and also affects mid-layers query more, leading us to hypothesise that this may have caused the out-of-domain degradation.
arXiv Detail & Related papers (2025-07-13T19:04:17Z) - Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective [82.24301452333577]
Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning.<n>A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains.<n>We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains.
arXiv Detail & Related papers (2025-06-17T20:24:00Z) - Incentivizing LLMs to Self-Verify Their Answers [20.2584779107763]
Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks.<n>We propose a framework that incentivizes LLMs to self-verify their own answers.<n>We train our self-verification models based on Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B.
arXiv Detail & Related papers (2025-06-02T06:54:29Z) - Maximizing Confidence Alone Improves Reasoning [48.83927980325788]
RENT: Reinforcement Learning via Entropy Minimization is a fully unsupervised RL method that requires no external reward or ground-truth answers.<n>We find that by reinforcing the chains of thought that yield high model confidence on its generated answers, the model improves its reasoning ability.
arXiv Detail & Related papers (2025-05-28T17:59:37Z) - Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models [27.142703756752997]
We introduce MathIF, a benchmark for evaluating instruction-following in mathematical reasoning tasks.<n>Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability.<n>We show that even simple interventions can partially recover obedience, though at the cost of reasoning performance.
arXiv Detail & Related papers (2025-05-20T18:18:01Z) - General-Reasoner: Advancing LLM Reasoning Across All Domains [64.70599911897595]
Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains.<n>We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc.
arXiv Detail & Related papers (2025-05-20T17:41:33Z) - Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models.<n>We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z) - Distributional Successor Features Enable Zero-Shot Policy Optimization [36.53356539916603]
This work proposes a novel class of models, i.e., Distributional Successor Features for Zero-Shot Policy Optimization (DiSPOs)<n>DiSPOs learn a distribution of successor features of a stationary dataset's behavior policy, along with a policy that acts to realize different successor features achievable within the dataset.<n>By directly modeling long-term outcomes in the dataset, DiSPOs avoid compounding error while enabling a simple scheme for zero-shot policy optimization across reward functions.
arXiv Detail & Related papers (2024-03-10T22:27:21Z) - ReFT: Reasoning with Reinforced Fine-Tuning [9.80361828538909]
We propose a simple yet effective approach called Reinforced Fine-Tuning (ReFT) to enhance the generalizability of learning LLMs for reasoning.<n>ReFT first warmups the model with SFT, and then employs on-line reinforcement learning, specifically the PPO algorithm in this paper.<n>Experiments on GSM8K, MathQA, and SVAMP datasets show that ReFT significantly outperforms SFT.
arXiv Detail & Related papers (2024-01-17T04:43:21Z) - Scaling Relationship on Learning Mathematical Reasoning with Large
Language Models [75.29595679428105]
We investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM.
We find that rejection samples from multiple models push LLaMA-7B to an accuracy of 49.3% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.
arXiv Detail & Related papers (2023-08-03T15:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.