On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
- URL: http://arxiv.org/abs/2508.05629v2
- Date: Thu, 16 Oct 2025 13:40:55 GMT
- Title: On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
- Authors: Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, Xu Yang,
- Abstract summary: We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM)<n>We reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model.<n>We propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token.
- Score: 61.607788999847564
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.
Related papers
- CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning [67.78566256784404]
Supervised Fine-Tuning (SFT) facilitates fast adaptation, it often triggers knowledge overwriting.<n>Reinforcement Learning (RL) demonstrates an inherent resilience that shields prior interaction logic from erasure.<n>We propose a textbfContinual textbfGUI textbfLearning framework that balances efficiency and skill retention.
arXiv Detail & Related papers (2026-03-03T13:02:20Z) - Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training [61.1421888242439]
Supervised fine-tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL)<n>We propose a framework to bridge this chasm by enabling On-Policy SFT.
arXiv Detail & Related papers (2026-02-12T17:59:58Z) - SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning [54.393763477932474]
Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has emerged as the standard post-training paradigm for large language models (LLMs)<n>We propose SED-SFT, which adaptively encourages diversity based on the token exploration space.<n>This framework introduces a selective entropy regularization term with a selective masking mechanism into the optimization objective.
arXiv Detail & Related papers (2026-02-07T09:39:21Z) - Enhancing Large Language Model Reasoning via Selective Critical Token Fine-Tuning [18.934789236342244]
Large language models (LLMs) primarily rely on supervised fine-tuning (SFT) to adapt pre-trained models to domain-specific tasks such as mathematical reasoning.<n>Standard SFT uniformly penalizes all tokens, neglecting that only a small subset of critical tokens determines reasoning correctness.<n>We propose Critical Token Fine-tuning (CFT), a simple yet effective approach that updates only tokens identified as functionally indispensable via counterfactual perturbations.
arXiv Detail & Related papers (2025-10-13T03:25:36Z) - Beyond Imitation: Recovering Dense Rewards from Demonstrations [64.05543657441218]
supervised fine-tuning is treated as a simple imitation learning process that only trains a policy to imitate expert behavior on datasets.<n>We prove that the SFT process does not just learn a policy, but also an implicit, dense, token-level reward model that explains the expert demonstrations.<n>Dense-Path REINFORCE consistently outperforms the original SFT models on instruction-following benchmarks.
arXiv Detail & Related papers (2025-10-02T18:58:26Z) - Debunk the Myth of SFT Generalization [13.700645417996412]
A prevailing view holds that supervised fine-tuning (SFT) fails to generalize, whereas reinforcement learning (RL) attains broader robustness.<n>We show that much of SFT's perceived failure stems from frozen-prompt artifacts.<n>We ask whether SFT can generalize strictly harder tasks.
arXiv Detail & Related papers (2025-09-30T20:01:09Z) - Anchored Supervised Fine-Tuning [26.17356786243252]
Post-training of large language models involves a trade-off between supervised fine-tuning and reinforcement learning.<n> Dynamic Fine-Tuning (DFT) recently emerged as a promising middle ground, reweighting SFT objectives with token probabilities.<n>We propose Anchored Supervised Fine-Tuning (ASFT) to augment DFT's reweighting with lightweight KL regularization to preserve tightness while ensuring stability.
arXiv Detail & Related papers (2025-09-28T08:58:12Z) - AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance [7.685078284407324]
Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL)<n>Recent single-stage methods attempt to unify SFT and RL using principleds, but lack a mechanism for dynamically balancing the two paradigms.<n>We introduce textbf Meta Fine-Tuning (AMFT), a novel single-stage algorithm that learns the optimal balance between SFT's implicit, path-level reward and RL's explicit, outcome-based reward.
arXiv Detail & Related papers (2025-08-09T11:40:54Z) - Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved) [5.000768714035796]
We draw on a connection between supervised fine-tuning (SFT) and the theory and practice of finding optimal policies via Reinforcement Learning (RL)<n>We show that a small modification to SFT leads to an importance weighted variant that behaves closer to training with RL as it.<n>We refer to this variant as importance weighted supervised fine-tuning (iw-SFT)
arXiv Detail & Related papers (2025-07-17T07:26:54Z) - Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling [35.64557242726578]
Prefix-RFT is a hybrid approach that synergizes learning from both demonstration and exploration.<n>It not only surpasses the performance of standalone SFT and RFT but also outperforms parallel mixed-policy RFT methods.
arXiv Detail & Related papers (2025-07-02T13:04:09Z) - Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [65.36449542323277]
We present a unified theoretical framework bridgingSupervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training.<n>We propose a simple yet effective learning rate reduction approach that yields significant performance improvements.
arXiv Detail & Related papers (2025-06-15T05:42:29Z) - UFT: Unifying Supervised and Reinforcement Fine-Tuning [21.195897792629548]
We propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process.<n>UFT enables the model to effectively explore solutions while incorporating informative supervision signals.<n>We theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck.
arXiv Detail & Related papers (2025-05-22T17:53:57Z) - Compile Scene Graphs with Reinforcement Learning [69.36723767339001]
Next-token prediction is the fundamental principle for training large language models (LLMs)<n>We introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset.<n>We design a set of graph-centric rewards, including three recall-based variants -- Hard Recall, Hard Recall+Relax, and Soft Recall.
arXiv Detail & Related papers (2025-04-18T10:46:22Z) - Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data [73.04828796123581]
Supervised fine-tuning (SFT) has become a crucial step for aligning pretrained large language models (LLMs)<n>We introduce Discriminative Fine-Tuning (DFT), an improved variant of SFT, which mitigates the burden of collecting human-labeled preference data.<n>Our contributions include: (i) a discriminative probabilistic framework for fine-tuning LLMs by explicitly modeling the discriminative likelihood of an answer among all possible outputs given an input; (ii) efficient algorithms to optimize this discriminative likelihood; and (iii) extensive experiments demonstrating DFT's effectiveness
arXiv Detail & Related papers (2025-02-25T22:38:55Z) - ReFT: Reasoning with Reinforced Fine-Tuning [9.80361828538909]
We propose a simple yet effective approach called Reinforced Fine-Tuning (ReFT) to enhance the generalizability of learning LLMs for reasoning.<n>ReFT first warmups the model with SFT, and then employs on-line reinforcement learning, specifically the PPO algorithm in this paper.<n>Experiments on GSM8K, MathQA, and SVAMP datasets show that ReFT significantly outperforms SFT.
arXiv Detail & Related papers (2024-01-17T04:43:21Z) - Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT)
We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z) - Decouple Graph Neural Networks: Train Multiple Simple GNNs Simultaneously Instead of One [60.5818387068983]
Graph neural networks (GNN) suffer from severe inefficiency.
We propose to decouple a multi-layer GNN as multiple simple modules for more efficient training.
We show that the proposed framework is highly efficient with reasonable performance.
arXiv Detail & Related papers (2023-04-20T07:21:32Z) - Functional Regularization for Reinforcement Learning via Learned Fourier
Features [98.90474131452588]
We propose a simple architecture for deep reinforcement learning by embedding inputs into a learned Fourier basis.
We show that it improves the sample efficiency of both state-based and image-based RL.
arXiv Detail & Related papers (2021-12-06T18:59:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.