Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections
- URL: http://arxiv.org/abs/2507.00018v2
- Date: Fri, 04 Jul 2025 08:16:16 GMT
- Title: Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections
- Authors: Bo Wang, Qinyuan Cheng, Runyu Peng, Rong Bao, Peiji Li, Qipeng Guo, Linyang Li, Zhiyuan Zeng, Yunhua Zhou, Xipeng Qiu,
- Abstract summary: We present a unified theoretical framework bridgingSupervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training.<n>We propose a simple yet effective learning rate reduction approach that yields significant performance improvements.
- Score: 65.36449542323277
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Post-training processes are essential phases in grounding pre-trained language models to real-world tasks, with learning from demonstrations or preference signals playing a crucial role in this adaptation. We present a unified theoretical framework bridging Supervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training. Through rigorous mathematical derivation, we demonstrate that both SFT and preference learning methods like Direct Preference Optimization (DPO) operate within the same optimal policy-reward subspace, with SFT representing a special case of implicit reward learning. Our analysis reveals a critical limitation in conventional SFT: the KL divergence term in distribution matching becomes constant with respect to the policy during optimization, failing to constrain model updates. To address this, we propose a simple yet effective learning rate reduction approach that yields significant performance improvements (up to \textbf{25\%} relative gain and \textbf{6\%} absolute win rate increase in instruction following tasks. Additionally, we derive alternative SFT objectives from various f-divergence functions that preserve the KL term during optimization, further enhancing post-DPO model performance. Finally, we extend the theoretical relationship between LLM logits and Q-functions from preference learning to the SFT context, providing mathematical derivations and experimental validation.
Related papers
- On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification [50.30835290642069]
We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM)<n>We reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model.<n>We propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token.
arXiv Detail & Related papers (2025-08-07T17:59:04Z) - Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling [35.64557242726578]
Prefix-RFT is a hybrid approach that synergizes learning from both demonstration and exploration.<n>It not only surpasses the performance of standalone SFT and RFT but also outperforms parallel mixed-policy RFT methods.
arXiv Detail & Related papers (2025-07-02T13:04:09Z) - Optimization-Inspired Few-Shot Adaptation for Large Language Models [25.439708260502556]
Large Language Models (LLMs) have demonstrated remarkable performance in real-world applications.<n>Adapting LLMs to novel tasks via fine-tuning often requires substantial training data and computational resources that are impractical in few-shot scenarios.<n>Existing approaches, such as in-context learning and.<n>Efficient Fine-Tuning (PEFT), face key limitations.
arXiv Detail & Related papers (2025-05-25T11:54:23Z) - Preference-Oriented Supervised Fine-Tuning: Favoring Target Model Over Aligned Large Language Models [12.500777267361102]
We introduce a novel textbfpreference-textbforiented supervised textbffine-textbftuning approach, namely PoFT.<n>The intuition is to boost SFT by imposing a particular preference: textitfavoring the target model over aligned LLMs on the same SFT data.<n>PoFT achieves stable and consistent improvements over the SFT baselines across different training datasets and base models.
arXiv Detail & Related papers (2024-12-17T12:49:14Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - SLCA++: Unleash the Power of Sequential Fine-tuning for Continual Learning with Pre-training [68.7896349660824]
We present an in-depth analysis of the progressive overfitting problem from the lens of Seq FT.
Considering that the overly fast representation learning and the biased classification layer constitute this particular problem, we introduce the advanced Slow Learner with Alignment (S++) framework.
Our approach involves a Slow Learner to selectively reduce the learning rate of backbone parameters, and a Alignment to align the disjoint classification layers in a post-hoc fashion.
arXiv Detail & Related papers (2024-08-15T17:50:07Z) - Pareto Low-Rank Adapters: Efficient Multi-Task Learning with Preferences [49.14535254003683]
We introduce PaLoRA, a novel parameter-efficient method that addresses multi-task trade-offs in machine learning.<n>Our experiments show that PaLoRA outperforms state-of-the-art MTL and PFL baselines across various datasets.
arXiv Detail & Related papers (2024-07-10T21:25:51Z) - Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process [26.196705232699884]
We introduce Intuitive Fine-Tuning (IFT) to integrate SFT and Preference Optimization into a single process.
IFT performs comparably or even superiorly to sequential recipes of SFT and some typical Preference Optimization methods.
An explainable Frozen Lake game further validates the effectiveness of IFT for getting competitive policy.
arXiv Detail & Related papers (2024-05-20T08:23:28Z) - HFT: Half Fine-Tuning for Large Language Models [42.60438623804577]
Large language models (LLMs) with one or more fine-tuning phases have become a necessary step to unlock various capabilities.
In this paper, we find that by regularly resetting partial parameters, LLMs can restore some of the original knowledge.
We introduce Half Fine-Tuning (HFT) for LLMs, as a substitute for full fine-tuning (FFT), to mitigate the forgetting issues.
arXiv Detail & Related papers (2024-04-29T07:07:58Z) - Model Extrapolation Expedites Alignment [135.12769233630362]
We propose a method called ExPO to expedite alignment training with human preferences.<n>We demonstrate that ExPO boosts a DPO model trained with only 20% steps to outperform the fully-trained one.<n>We show that ExPO notably improves existing open-source LLMs on the leading AlpacaEval 2.0 and MT-Bench benchmarks.
arXiv Detail & Related papers (2024-04-25T17:39:50Z) - Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT)
We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.