Anchored Supervised Fine-Tuning
- URL: http://arxiv.org/abs/2509.23753v2
- Date: Fri, 03 Oct 2025 08:30:05 GMT
- Title: Anchored Supervised Fine-Tuning
- Authors: He Zhu, Junyou Su, Peng Lai, Ren Ma, Wenjia Zhang, Linyi Yang, Guanhua Chen,
- Abstract summary: Post-training of large language models involves a trade-off between supervised fine-tuning and reinforcement learning.<n> Dynamic Fine-Tuning (DFT) recently emerged as a promising middle ground, reweighting SFT objectives with token probabilities.<n>We propose Anchored Supervised Fine-Tuning (ASFT) to augment DFT's reweighting with lightweight KL regularization to preserve tightness while ensuring stability.
- Score: 26.17356786243252
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Post-training of large language models involves a fundamental trade-off between supervised fine-tuning (SFT), which efficiently mimics demonstrations but tends to memorize, and reinforcement learning (RL), which achieves better generalization at higher computational cost. Dynamic Fine-Tuning (DFT) recently emerged as a promising middle ground, reweighting SFT objectives with token probabilities and achieving improvements in certain reasoning domains, though it exhibits instability in other tasks. We provide a analysis of DFT through the reward-weighted regression (RWR) framework, revealing that it corresponds to a specific auxiliary distribution choice that yields provably tighter RL bounds than standard SFT. However, our analysis also uncovers a critical limitation: this construction lacks distributional anchoring, leading to progressive drift that undermines training stability. To address this, we propose Anchored Supervised Fine-Tuning (ASFT), which augments DFT's reweighting with lightweight KL regularization to preserve tightness while ensuring stability. Empirically, ASFT consistently outperforms both SFT and DFT across mathematical reasoning, medical knowledge grounding, and code generation, achieving substantial improvements with minimal computational overhead. Our RWR framework provides a systematic lens for understanding post-training methods and demonstrates that principled theoretical analysis leads to both stronger guarantees and practical gains.
Related papers
- SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning [54.393763477932474]
Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has emerged as the standard post-training paradigm for large language models (LLMs)<n>We propose SED-SFT, which adaptively encourages diversity based on the token exploration space.<n>This framework introduces a selective entropy regularization term with a selective masking mechanism into the optimization objective.
arXiv Detail & Related papers (2026-02-07T09:39:21Z) - Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation [56.92367609590823]
Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs)<n>We argue that Long CoT is inherently ill-suited for the sequential recommendation domain.<n>We propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation.
arXiv Detail & Related papers (2026-01-31T10:02:43Z) - Trust-Region Adaptive Policy Optimization [82.09255251747818]
Post-training methods play an important role in improving large language models' (LLMs) complex reasoning abilities.<n>We introduce TRAPO, a framework that interleavesSupervised Fine-Tuning (SFT) and Reinforcement Learning (RL) within each training instance.<n>Experiments on five mathematical reasoning benchmarks show that TRAPO consistently surpasses standard SFT, RL, and SFT-then-RL pipelines.
arXiv Detail & Related papers (2025-12-19T14:37:07Z) - Rethinking Expert Trajectory Utilization in LLM Post-training [35.018182540417236]
We propose the Plasticity-Ceiling Framework to ground this landscape.<n>We establish the Sequential SFT-then-RL pipeline as the superior standard.<n>Our findings provide actionable guidelines for maximizing the value extracted from expert trajectories.
arXiv Detail & Related papers (2025-12-12T11:13:00Z) - AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance [5.748208737701793]
Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL)<n>Recent single-stage methods attempt to unify SFT and RL using principleds, but lack a mechanism for dynamically balancing the two paradigms.<n>We introduce textbf Meta Fine-Tuning (AMFT), a novel single-stage algorithm that learns the optimal balance between SFT's implicit, path-level reward and RL's explicit, outcome-based reward.
arXiv Detail & Related papers (2025-08-09T11:40:54Z) - On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification [50.30835290642069]
We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM)<n>We reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model.<n>We propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token.
arXiv Detail & Related papers (2025-08-07T17:59:04Z) - Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training [36.69514399442043]
This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT)<n>Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks.
arXiv Detail & Related papers (2025-07-07T18:17:06Z) - Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective [98.45690529036848]
Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks.<n>While effective at task adaptation, their impact on prior knowledge remains unclear.
arXiv Detail & Related papers (2025-06-30T04:15:01Z) - Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [65.36449542323277]
We present a unified theoretical framework bridgingSupervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training.<n>We propose a simple yet effective learning rate reduction approach that yields significant performance improvements.
arXiv Detail & Related papers (2025-06-15T05:42:29Z) - Doubly Robust Instance-Reweighted Adversarial Training [107.40683655362285]
We propose a novel doubly-robust instance reweighted adversarial framework.
Our importance weights are obtained by optimizing the KL-divergence regularized loss function.
Our proposed approach outperforms related state-of-the-art baseline methods in terms of average robust performance.
arXiv Detail & Related papers (2023-08-01T06:16:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.