On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
- URL: http://arxiv.org/abs/2508.05629v1
- Date: Thu, 07 Aug 2025 17:59:04 GMT
- Title: On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
- Authors: Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, Xu Yang,
- Abstract summary: We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM)<n>We reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model.<n>We propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token.
- Score: 50.30835290642069
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.
Related papers
- Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved) [5.000768714035796]
We draw on a connection between supervised fine-tuning (SFT) and the theory and practice of finding optimal policies via Reinforcement Learning (RL)<n>We show that a small modification to SFT leads to an importance weighted variant that behaves closer to training with RL as it.<n>We refer to this variant as importance weighted supervised fine-tuning (iw-SFT)
arXiv Detail & Related papers (2025-07-17T07:26:54Z) - Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling [35.64557242726578]
Prefix-RFT is a hybrid approach that synergizes learning from both demonstration and exploration.<n>It not only surpasses the performance of standalone SFT and RFT but also outperforms parallel mixed-policy RFT methods.
arXiv Detail & Related papers (2025-07-02T13:04:09Z) - Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [65.36449542323277]
We present a unified theoretical framework bridgingSupervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training.<n>We propose a simple yet effective learning rate reduction approach that yields significant performance improvements.
arXiv Detail & Related papers (2025-06-15T05:42:29Z) - UFT: Unifying Supervised and Reinforcement Fine-Tuning [21.195897792629548]
We propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process.<n>UFT enables the model to effectively explore solutions while incorporating informative supervision signals.<n>We theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck.
arXiv Detail & Related papers (2025-05-22T17:53:57Z) - Compile Scene Graphs with Reinforcement Learning [69.36723767339001]
Next-token prediction is the fundamental principle for training large language models (LLMs)<n>We introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset.<n>We design a set of graph-centric rewards, including three recall-based variants -- Hard Recall, Hard Recall+Relax, and Soft Recall.
arXiv Detail & Related papers (2025-04-18T10:46:22Z) - Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data [73.04828796123581]
Supervised fine-tuning (SFT) has become a crucial step for aligning pretrained large language models (LLMs)<n>We introduce Discriminative Fine-Tuning (DFT), an improved variant of SFT, which mitigates the burden of collecting human-labeled preference data.<n>Our contributions include: (i) a discriminative probabilistic framework for fine-tuning LLMs by explicitly modeling the discriminative likelihood of an answer among all possible outputs given an input; (ii) efficient algorithms to optimize this discriminative likelihood; and (iii) extensive experiments demonstrating DFT's effectiveness
arXiv Detail & Related papers (2025-02-25T22:38:55Z) - ReFT: Reasoning with Reinforced Fine-Tuning [9.80361828538909]
We propose a simple yet effective approach called Reinforced Fine-Tuning (ReFT) to enhance the generalizability of learning LLMs for reasoning.<n>ReFT first warmups the model with SFT, and then employs on-line reinforcement learning, specifically the PPO algorithm in this paper.<n>Experiments on GSM8K, MathQA, and SVAMP datasets show that ReFT significantly outperforms SFT.
arXiv Detail & Related papers (2024-01-17T04:43:21Z) - Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT)
We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z) - Decouple Graph Neural Networks: Train Multiple Simple GNNs Simultaneously Instead of One [60.5818387068983]
Graph neural networks (GNN) suffer from severe inefficiency.
We propose to decouple a multi-layer GNN as multiple simple modules for more efficient training.
We show that the proposed framework is highly efficient with reasonable performance.
arXiv Detail & Related papers (2023-04-20T07:21:32Z) - Functional Regularization for Reinforcement Learning via Learned Fourier
Features [98.90474131452588]
We propose a simple architecture for deep reinforcement learning by embedding inputs into a learned Fourier basis.
We show that it improves the sample efficiency of both state-based and image-based RL.
arXiv Detail & Related papers (2021-12-06T18:59:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.