Trust-Region Adaptive Policy Optimization
- URL: http://arxiv.org/abs/2512.17636v1
- Date: Fri, 19 Dec 2025 14:37:07 GMT
- Title: Trust-Region Adaptive Policy Optimization
- Authors: Mingyu Su, Jian Guan, Yuxian Gu, Minlie Huang, Hongning Wang,
- Abstract summary: Post-training methods play an important role in improving large language models' (LLMs) complex reasoning abilities.<n>We introduce TRAPO, a framework that interleavesSupervised Fine-Tuning (SFT) and Reinforcement Learning (RL) within each training instance.<n>Experiments on five mathematical reasoning benchmarks show that TRAPO consistently surpasses standard SFT, RL, and SFT-then-RL pipelines.
- Score: 82.09255251747818
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Post-training methods, especially Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), play an important role in improving large language models' (LLMs) complex reasoning abilities. However, the dominant two-stage pipeline (SFT then RL) suffers from a key inconsistency: SFT enforces rigid imitation that suppresses exploration and induces forgetting, limiting RL's potential for improvements. We address this inefficiency with TRAPO (\textbf{T}rust-\textbf{R}egion \textbf{A}daptive \textbf{P}olicy \textbf{O}ptimization), a hybrid framework that interleaves SFT and RL within each training instance by optimizing SFT loss on expert prefixes and RL loss on the model's own completions, unifying external supervision and self-exploration. To stabilize training, we introduce Trust-Region SFT (TrSFT), which minimizes forward KL divergence inside a trust region but attenuates optimization outside, effectively shifting toward reverse KL and yielding stable, mode-seeking updates favorable for RL. An adaptive prefix-selection mechanism further allocates expert guidance based on measured utility. Experiments on five mathematical reasoning benchmarks show that TRAPO consistently surpasses standard SFT, RL, and SFT-then-RL pipelines, as well as recent state-of-the-art approaches, establishing a strong new paradigm for reasoning-enhanced LLMs.
Related papers
- Stabilizing Policy Optimization via Logits Convexity [59.242732612484474]
We show that the convexity of the supervised fine-tuning loss with respect to model logits plays a key role in enabling stable training.<n>Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework.
arXiv Detail & Related papers (2026-03-01T07:40:12Z) - SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning [54.393763477932474]
Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has emerged as the standard post-training paradigm for large language models (LLMs)<n>We propose SED-SFT, which adaptively encourages diversity based on the token exploration space.<n>This framework introduces a selective entropy regularization term with a selective masking mechanism into the optimization objective.
arXiv Detail & Related papers (2026-02-07T09:39:21Z) - Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning [8.550698116833123]
Post-training of reasoning LLMs typically consists of an offline SFT stage followed by an online reinforcement learning stage.<n>We show that, after identical RL training, models from stronger SFT checkpoints can significantly underperform those from weaker ones.<n>We propose PEAR, an SFT-stage method that corrects this mismatch and better prepares the model for RL.
arXiv Detail & Related papers (2026-02-01T06:53:45Z) - Consolidation or Adaptation? PRISM: Disentangling SFT and RL Data via Gradient Concentration [56.074760766965085]
PRISM achieves a dynamics-aware framework that arbitrates data based on its degree of cognitive conflict with the model's existing knowledge.<n>Our findings suggest that disentangling data based on internal optimization regimes is crucial for scalable and robust agent alignment.
arXiv Detail & Related papers (2026-01-12T05:43:20Z) - Rethinking Expert Trajectory Utilization in LLM Post-training [35.018182540417236]
We propose the Plasticity-Ceiling Framework to ground this landscape.<n>We establish the Sequential SFT-then-RL pipeline as the superior standard.<n>Our findings provide actionable guidelines for maximizing the value extracted from expert trajectories.
arXiv Detail & Related papers (2025-12-12T11:13:00Z) - Anchored Supervised Fine-Tuning [26.17356786243252]
Post-training of large language models involves a trade-off between supervised fine-tuning and reinforcement learning.<n> Dynamic Fine-Tuning (DFT) recently emerged as a promising middle ground, reweighting SFT objectives with token probabilities.<n>We propose Anchored Supervised Fine-Tuning (ASFT) to augment DFT's reweighting with lightweight KL regularization to preserve tightness while ensuring stability.
arXiv Detail & Related papers (2025-09-28T08:58:12Z) - Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning [36.06085913761571]
This study introduces a novel method for learning reasoning models that employs bilevel optimization.<n>By conditioning the SFT objective on the optimal RL policy, our approach enables SFT to meta-learn how to guide RL's optimization process.
arXiv Detail & Related papers (2025-09-08T17:58:02Z) - AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance [7.685078284407324]
Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL)<n>Recent single-stage methods attempt to unify SFT and RL using principleds, but lack a mechanism for dynamically balancing the two paradigms.<n>We introduce textbf Meta Fine-Tuning (AMFT), a novel single-stage algorithm that learns the optimal balance between SFT's implicit, path-level reward and RL's explicit, outcome-based reward.
arXiv Detail & Related papers (2025-08-09T11:40:54Z) - The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs [66.17068546293487]
Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning.<n>We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks.<n>We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones.
arXiv Detail & Related papers (2025-07-10T09:05:49Z) - Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective [98.45690529036848]
Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks.<n>While effective at task adaptation, their impact on prior knowledge remains unclear.
arXiv Detail & Related papers (2025-06-30T04:15:01Z) - Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [65.36449542323277]
We present a unified theoretical framework bridgingSupervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training.<n>We propose a simple yet effective learning rate reduction approach that yields significant performance improvements.
arXiv Detail & Related papers (2025-06-15T05:42:29Z) - On the Power of Perturbation under Sampling in Solving Extensive-Form Games [56.013335390600524]
We investigate how perturbation does and does not improve the Follow-the-Regularized-Leader (FTRL) algorithm in solving extensive-form games under sampling.<n>We present a unified framework for textitPerturbed FTRL algorithms and study two variants: PFTRL-KL and PFTRL-RKL.
arXiv Detail & Related papers (2025-01-28T00:29:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.