Related papers: On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training

On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training

URL: http://arxiv.org/abs/2601.07389v1
Date: Mon, 12 Jan 2026 10:14:09 GMT
Title: On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training
Authors: Xueyan Niu, Bo Bai, Wei Han, Weixi Zhang,
Abstract summary: Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL)<n>We show that RL increases SFT loss under SFT optimality and that SFT lowers the reward achieved by RL.<n> Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training.
Score: 10.433802085981046
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs and expert responses, while RL maximizes reward signals derived from human preferences or rule-based verifiers. Modern reasoning models have widely adopted the practice of alternating SFT and RL training. However, there is no theoretical account of whether they can be decoupled. We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under SFT optimality and (2) RL-then-SFT coupling: SFT lowers the reward achieved by RL. Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training

Related papers

Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training [61.1421888242439]
Supervised fine-tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL)<n>We propose a framework to bridge this chasm by enabling On-Policy SFT.
arXiv Detail & Related papers (2026-02-12T17:59:58Z)
SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning [54.393763477932474]
Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has emerged as the standard post-training paradigm for large language models (LLMs)<n>We propose SED-SFT, which adaptively encourages diversity based on the token exploration space.<n>This framework introduces a selective entropy regularization term with a selective masking mechanism into the optimization objective.
arXiv Detail & Related papers (2026-02-07T09:39:21Z)
Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning [8.550698116833123]
Post-training of reasoning LLMs typically consists of an offline SFT stage followed by an online reinforcement learning stage.<n>We show that, after identical RL training, models from stronger SFT checkpoints can significantly underperform those from weaker ones.<n>We propose PEAR, an SFT-stage method that corrects this mismatch and better prepares the model for RL.
arXiv Detail & Related papers (2026-02-01T06:53:45Z)
Trust-Region Adaptive Policy Optimization [82.09255251747818]
Post-training methods play an important role in improving large language models' (LLMs) complex reasoning abilities.<n>We introduce TRAPO, a framework that interleavesSupervised Fine-Tuning (SFT) and Reinforcement Learning (RL) within each training instance.<n>Experiments on five mathematical reasoning benchmarks show that TRAPO consistently surpasses standard SFT, RL, and SFT-then-RL pipelines.
arXiv Detail & Related papers (2025-12-19T14:37:07Z)
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning [49.22815446849924]
Large Language Models (LLMs) often struggle with problems that require multi-step reasoning.<n>For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled.<n>We propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions"
arXiv Detail & Related papers (2025-10-29T22:05:08Z)
Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners [28.039145840787683]
Supervised fine-tuning (SFT) offers complementary benefits but typically requires large-scale data and risks overfitting.<n>Recent attempts to combine SFT and RL face three main challenges: data inefficiency, algorithm-specific designs, and catastrophic forgetting.<n>We propose a plug-and-play framework that dynamically integrates SFT into RL by selecting challenging examples for SFT.
arXiv Detail & Related papers (2025-10-06T03:01:14Z)
Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead [20.446287312285648]
We study whether high SFT scores translate to improved performance after RL.<n>We find high SFT scores can be biased toward simpler or more homogeneous data and are not reliably predictive of subsequent RL gains or scaled-up post-training effectiveness.<n>We study alternative metrics and identify generalization loss on held-out reasoning examples and Pass@large k performance to provide strong proxies for the RL outcome.
arXiv Detail & Related papers (2025-10-02T02:57:00Z)
RL Fine-Tuning Heals OOD Forgetting in SFT [35.01074051556079]
We investigate the evolution and mechanism behind the synergy ofSupervised Fine-Tuning and Reinforcement Learning.<n>Our findings re-identify the roles of SFT and RL in the two-stage fine-tuning and discover the rotation of singular vectors as the key mechanism.
arXiv Detail & Related papers (2025-09-08T21:40:41Z)
Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning [36.06085913761571]
This study introduces a novel method for learning reasoning models that employs bilevel optimization.<n>By conditioning the SFT objective on the optimal RL policy, our approach enables SFT to meta-learn how to guide RL's optimization process.
arXiv Detail & Related papers (2025-09-08T17:58:02Z)
The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs [66.17068546293487]
Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning.<n>We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks.<n>We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones.
arXiv Detail & Related papers (2025-07-10T09:05:49Z)
AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy [48.30596996677882]
We investigate the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL) in developing strong reasoning models.<n> scaling strategies yield notable improvements in reasoning performance.<n>Our AceReason-Nemotron-1.1 7B model significantly outperforms AceReason-Nemotron-1.0 and new state-of-the-art performance among Qwen2.5-7B-based reasoning models.
arXiv Detail & Related papers (2025-06-16T09:27:48Z)
On the Power of Perturbation under Sampling in Solving Extensive-Form Games [56.013335390600524]
We investigate how perturbation does and does not improve the Follow-the-Regularized-Leader (FTRL) algorithm in solving extensive-form games under sampling.<n>We present a unified framework for textitPerturbed FTRL algorithms and study two variants: PFTRL-KL and PFTRL-RKL.
arXiv Detail & Related papers (2025-01-28T00:29:38Z)
More Benefits of Being Distributional: Second-Order Bounds for Reinforcement Learning [58.626683114119906]
We show that Distributional Reinforcement Learning (DistRL) can obtain second-order bounds in both online and offline RL. Our results are the first second-order bounds for low-rank MDPs and for offline RL.
arXiv Detail & Related papers (2024-02-11T13:25:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.