AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance
- URL: http://arxiv.org/abs/2508.06944v2
- Date: Tue, 12 Aug 2025 03:23:43 GMT
- Title: AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance
- Authors: Lixuan He, Jie Feng, Yong Li,
- Abstract summary: Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL)<n>Recent single-stage methods attempt to unify SFT and RL using principleds, but lack a mechanism for dynamically balancing the two paradigms.<n>We introduce textbf Meta Fine-Tuning (AMFT), a novel single-stage algorithm that learns the optimal balance between SFT's implicit, path-level reward and RL's explicit, outcome-based reward.
- Score: 5.748208737701793
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and exploration. Recent single-stage methods attempt to unify SFT and RL using heuristics, but lack a principled mechanism for dynamically balancing the two paradigms. In this paper, we reframe this challenge through the theoretical lens of \textbf{implicit rewards}, viewing SFT and RL not as distinct methods but as complementary reward signals. We introduce \textbf{Adaptive Meta Fine-Tuning (AMFT)}, a novel single-stage algorithm that learns the optimal balance between SFT's implicit, path-level reward and RL's explicit, outcome-based reward. The core of AMFT is a \textbf{meta-gradient adaptive weight controller} that treats the SFT-RL balance as a learnable parameter, dynamically optimizing it to maximize long-term task performance. This forward-looking approach, regularized by policy entropy for stability, autonomously discovers an effective training curriculum. We conduct a comprehensive evaluation on challenging benchmarks spanning mathematical reasoning, abstract visual reasoning (General Points), and vision-language navigation (V-IRL). AMFT consistently establishes a new state-of-the-art and demonstrats superior generalization on out-of-distribution (OOD) tasks. Ablation studies and training dynamic analysis confirm that the meta-learning controller is crucial for AMFT's stability, sample efficiency, and performance, offering a more principled and effective paradigm for LLM alignment. Our codes are open-sourced via https://github.com/hlxtsyj/AMFT.
Related papers
- Stabilizing Policy Optimization via Logits Convexity [59.242732612484474]
We show that the convexity of the supervised fine-tuning loss with respect to model logits plays a key role in enabling stable training.<n>Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework.
arXiv Detail & Related papers (2026-03-01T07:40:12Z) - Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training [61.1421888242439]
Supervised fine-tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL)<n>We propose a framework to bridge this chasm by enabling On-Policy SFT.
arXiv Detail & Related papers (2026-02-12T17:59:58Z) - Trust-Region Adaptive Policy Optimization [82.09255251747818]
Post-training methods play an important role in improving large language models' (LLMs) complex reasoning abilities.<n>We introduce TRAPO, a framework that interleavesSupervised Fine-Tuning (SFT) and Reinforcement Learning (RL) within each training instance.<n>Experiments on five mathematical reasoning benchmarks show that TRAPO consistently surpasses standard SFT, RL, and SFT-then-RL pipelines.
arXiv Detail & Related papers (2025-12-19T14:37:07Z) - Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS [62.22644307952087]
We introduce AIRL-S, the first natural unification of RL-based and search-based TTS.<n>We leverage adversarial inverse reinforcement learning (AIRL) combined with group relative policy optimization (GRPO) to learn a dense, dynamic PRM directly from correct reasoning traces.<n>Our results show that our unified approach improves performance by 9 % on average over the base model, matching GPT-4o.
arXiv Detail & Related papers (2025-08-19T23:41:15Z) - On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification [50.30835290642069]
We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM)<n>We reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model.<n>We propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token.
arXiv Detail & Related papers (2025-08-07T17:59:04Z) - The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs [66.17068546293487]
Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning.<n>We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks.<n>We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones.
arXiv Detail & Related papers (2025-07-10T09:05:49Z) - Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling [35.64557242726578]
Prefix-RFT is a hybrid approach that synergizes learning from both demonstration and exploration.<n>It not only surpasses the performance of standalone SFT and RFT but also outperforms parallel mixed-policy RFT methods.
arXiv Detail & Related papers (2025-07-02T13:04:09Z) - Reinforcement Fine-Tuning Enables MLLMs Learning Novel Tasks Stably [80.36077974826865]
Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks.<n>We study the behavior of SFT and RFT on an open-source multimodal model, Qwen2.5-VL.<n>Our experiments reveal a sharp trade-off: SFT enables rapid task acquisition but leads to catastrophic forgetting, whereas RFT learns more slowly on novel tasks but maintains prior knowledge.
arXiv Detail & Related papers (2025-06-30T04:15:01Z) - SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning [20.442971494407896]
Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet the optimal integration of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge.<n>We propose Supervised Reinforcement Fine-Tuning (SRFT), a single-stage method that unifies both fine-tuning paradigms through entropy-aware weighting mechanisms.<n>Extensive experiments show that SRFT achieves 59.1% average accuracy, outperforming zero-RL methods by 9.0% on five mathematical reasoning benchmarks and 10.9% on three out-of-distribution benchmarks.
arXiv Detail & Related papers (2025-06-24T16:31:37Z) - Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [65.36449542323277]
We present a unified theoretical framework bridgingSupervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training.<n>We propose a simple yet effective learning rate reduction approach that yields significant performance improvements.
arXiv Detail & Related papers (2025-06-15T05:42:29Z) - Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning [55.33984461046492]
Policy-based methods currently dominate reinforcement learning pipelines for large language model (LLM) reasoning.<n>We introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs.<n>We prove convergence to the near-optimal KL-regularized policy from arbitrary off-policy via an improved change-of-trajectory-measure analysis.
arXiv Detail & Related papers (2025-05-21T09:41:53Z) - Step-wise Adaptive Integration of Supervised Fine-tuning and Reinforcement Learning for Task-Specific LLMs [13.292104357930866]
SASR is a step-wise adaptive hybrid training framework for large language models.<n>It unifies SFT and RL and dynamically balances the two throughout optimization.<n> Experimental results demonstrate that SASR outperforms SFT, RL, and static hybrid training methods.
arXiv Detail & Related papers (2025-05-19T12:10:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.