Rethinking Expert Trajectory Utilization in LLM Post-training
- URL: http://arxiv.org/abs/2512.11470v1
- Date: Fri, 12 Dec 2025 11:13:00 GMT
- Title: Rethinking Expert Trajectory Utilization in LLM Post-training
- Authors: Bowen Ding, Yuhan Chen, Jiayang Lv, Jiyao Yuan, Qi Zhu, Shuangshuang Tian, Dantong Zhu, Futing Wang, Heyuan Deng, Fei Mi, Lifeng Shang, Tao Lin,
- Abstract summary: We propose the Plasticity-Ceiling Framework to ground this landscape.<n>We establish the Sequential SFT-then-RL pipeline as the superior standard.<n>Our findings provide actionable guidelines for maximizing the value extracted from expert trajectories.
- Score: 35.018182540417236
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While effective post-training integrates Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), the optimal mechanism for utilizing expert trajectories remains unresolved. We propose the Plasticity-Ceiling Framework to theoretically ground this landscape, decomposing performance into foundational SFT performance and the subsequent RL plasticity. Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability deficits of synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the SFT Stable or Mild Overfitting Sub-phase maximizes the final ceiling by securing foundational SFT performance without compromising RL plasticity; (2) Refuting ``Less is More'' in the context of SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) Identifying that the Minimum SFT Validation Loss serves as a robust indicator for selecting the expert trajectories that maximize the final performance ceiling. Our findings provide actionable guidelines for maximizing the value extracted from expert trajectories.
Related papers
- Stabilizing Policy Optimization via Logits Convexity [59.242732612484474]
We show that the convexity of the supervised fine-tuning loss with respect to model logits plays a key role in enabling stable training.<n>Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework.
arXiv Detail & Related papers (2026-03-01T07:40:12Z) - SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning [54.393763477932474]
Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has emerged as the standard post-training paradigm for large language models (LLMs)<n>We propose SED-SFT, which adaptively encourages diversity based on the token exploration space.<n>This framework introduces a selective entropy regularization term with a selective masking mechanism into the optimization objective.
arXiv Detail & Related papers (2026-02-07T09:39:21Z) - Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning [8.550698116833123]
Post-training of reasoning LLMs typically consists of an offline SFT stage followed by an online reinforcement learning stage.<n>We show that, after identical RL training, models from stronger SFT checkpoints can significantly underperform those from weaker ones.<n>We propose PEAR, an SFT-stage method that corrects this mismatch and better prepares the model for RL.
arXiv Detail & Related papers (2026-02-01T06:53:45Z) - GIFT: Unlocking Global Optimality in Post-Training via Finite-Temperature Gibbs Initialization [9.388803723263392]
We reformulateSupervised Fine-Tuning (SFT) within a unified post-training framework and propose Gibbs Initialization with Finite Temperature (GIFT)<n>GIFT incorporates supervision as a finite-temperature energy potential, establishing a distributional bridge that ensures objective consistency throughout the post-training pipeline.
arXiv Detail & Related papers (2026-01-14T07:13:57Z) - Trust-Region Adaptive Policy Optimization [82.09255251747818]
Post-training methods play an important role in improving large language models' (LLMs) complex reasoning abilities.<n>We introduce TRAPO, a framework that interleavesSupervised Fine-Tuning (SFT) and Reinforcement Learning (RL) within each training instance.<n>Experiments on five mathematical reasoning benchmarks show that TRAPO consistently surpasses standard SFT, RL, and SFT-then-RL pipelines.
arXiv Detail & Related papers (2025-12-19T14:37:07Z) - Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning [30.751908700207185]
SFT plays a crucial role across several scenarios.<n>SFT with only 2K achieves comparable or better reasoning performance to RL with 20K.<n>We identify a pervasive issue of deceptive rewards, where higher rewards fail to correlate with better reasoning accuracy in RL.
arXiv Detail & Related papers (2025-12-14T13:46:42Z) - Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners [28.039145840787683]
Supervised fine-tuning (SFT) offers complementary benefits but typically requires large-scale data and risks overfitting.<n>Recent attempts to combine SFT and RL face three main challenges: data inefficiency, algorithm-specific designs, and catastrophic forgetting.<n>We propose a plug-and-play framework that dynamically integrates SFT into RL by selecting challenging examples for SFT.
arXiv Detail & Related papers (2025-10-06T03:01:14Z) - Anchored Supervised Fine-Tuning [26.17356786243252]
Post-training of large language models involves a trade-off between supervised fine-tuning and reinforcement learning.<n> Dynamic Fine-Tuning (DFT) recently emerged as a promising middle ground, reweighting SFT objectives with token probabilities.<n>We propose Anchored Supervised Fine-Tuning (ASFT) to augment DFT's reweighting with lightweight KL regularization to preserve tightness while ensuring stability.
arXiv Detail & Related papers (2025-09-28T08:58:12Z) - RL Fine-Tuning Heals OOD Forgetting in SFT [35.01074051556079]
We investigate the evolution and mechanism behind the synergy ofSupervised Fine-Tuning and Reinforcement Learning.<n>Our findings re-identify the roles of SFT and RL in the two-stage fine-tuning and discover the rotation of singular vectors as the key mechanism.
arXiv Detail & Related papers (2025-09-08T21:40:41Z) - Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning [36.06085913761571]
This study introduces a novel method for learning reasoning models that employs bilevel optimization.<n>By conditioning the SFT objective on the optimal RL policy, our approach enables SFT to meta-learn how to guide RL's optimization process.
arXiv Detail & Related papers (2025-09-08T17:58:02Z) - On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification [61.607788999847564]
We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM)<n>We reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model.<n>We propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token.
arXiv Detail & Related papers (2025-08-07T17:59:04Z) - AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy [48.30596996677882]
We investigate the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL) in developing strong reasoning models.<n> scaling strategies yield notable improvements in reasoning performance.<n>Our AceReason-Nemotron-1.1 7B model significantly outperforms AceReason-Nemotron-1.0 and new state-of-the-art performance among Qwen2.5-7B-based reasoning models.
arXiv Detail & Related papers (2025-06-16T09:27:48Z) - Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [65.36449542323277]
We present a unified theoretical framework bridgingSupervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training.<n>We propose a simple yet effective learning rate reduction approach that yields significant performance improvements.
arXiv Detail & Related papers (2025-06-15T05:42:29Z) - Understanding Forgetting in LLM Supervised Fine-Tuning and Preference Learning - A Convex Optimization Perspective [55.66517396157806]
The widely adopted approach in post-training popular open-source LLMs is to sequentially perform SFT and RLHF/DPO.<n>This is suboptimal in terms of SFT and RLHF/DPO trade-off.<n>We propose a practical joint post-training framework which has theoretical convergence guarantees and empirically outperforms sequential post-training framework.
arXiv Detail & Related papers (2024-10-20T19:38:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.