Related papers: Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

URL: http://arxiv.org/abs/2507.00432v1
Date: Tue, 01 Jul 2025 05:23:05 GMT
Title: Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Authors: Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, Xiang Yue,
Abstract summary: We evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks.<n>We find that most models that succeed in math fail to transfer their gains to other domains.<n>Our results suggest a need to rethink standard post-training recipes.
Score: 46.22610560773869
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.

Related papers

Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them [25.324955028065887]
Two popular approaches are reinforcement learning (RL) and supervised fine-tuning (SFT)<n>We find that RL yields minor in-domain gains on maths and slight degradation on knowledge-intensive benchmarks like MMLU.<n>SFT exhibits greater updates and also affects mid-layers query more, leading us to hypothesise that this may have caused the out-of-domain degradation.
arXiv Detail & Related papers (2025-07-13T19:04:17Z)
Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective [82.24301452333577]
Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning.<n>A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains.<n>We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains.
arXiv Detail & Related papers (2025-06-17T20:24:00Z)
Incentivizing LLMs to Self-Verify Their Answers [20.2584779107763]
Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks.<n>We propose a framework that incentivizes LLMs to self-verify their own answers.<n>We train our self-verification models based on Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B.
arXiv Detail & Related papers (2025-06-02T06:54:29Z)
Maximizing Confidence Alone Improves Reasoning [48.83927980325788]
RENT: Reinforcement Learning via Entropy Minimization is a fully unsupervised RL method that requires no external reward or ground-truth answers.<n>We find that by reinforcing the chains of thought that yield high model confidence on its generated answers, the model improves its reasoning ability.
arXiv Detail & Related papers (2025-05-28T17:59:37Z)
Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models [27.142703756752997]
We introduce MathIF, a benchmark for evaluating instruction-following in mathematical reasoning tasks.<n>Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability.<n>We show that even simple interventions can partially recover obedience, though at the cost of reasoning performance.
arXiv Detail & Related papers (2025-05-20T18:18:01Z)
General-Reasoner: Advancing LLM Reasoning Across All Domains [64.70599911897595]
Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains.<n>We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc.
arXiv Detail & Related papers (2025-05-20T17:41:33Z)
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models.<n>We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z)
Distributional Successor Features Enable Zero-Shot Policy Optimization [36.53356539916603]
This work proposes a novel class of models, i.e., Distributional Successor Features for Zero-Shot Policy Optimization (DiSPOs)<n>DiSPOs learn a distribution of successor features of a stationary dataset's behavior policy, along with a policy that acts to realize different successor features achievable within the dataset.<n>By directly modeling long-term outcomes in the dataset, DiSPOs avoid compounding error while enabling a simple scheme for zero-shot policy optimization across reward functions.
arXiv Detail & Related papers (2024-03-10T22:27:21Z)
ReFT: Reasoning with Reinforced Fine-Tuning [9.80361828538909]
We propose a simple yet effective approach called Reinforced Fine-Tuning (ReFT) to enhance the generalizability of learning LLMs for reasoning.<n>ReFT first warmups the model with SFT, and then employs on-line reinforcement learning, specifically the PPO algorithm in this paper.<n>Experiments on GSM8K, MathQA, and SVAMP datasets show that ReFT significantly outperforms SFT.
arXiv Detail & Related papers (2024-01-17T04:43:21Z)
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models [75.29595679428105]
We investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM. We find that rejection samples from multiple models push LLaMA-7B to an accuracy of 49.3% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.
arXiv Detail & Related papers (2023-08-03T15:34:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.