Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective
- URL: http://arxiv.org/abs/2506.23508v2
- Date: Fri, 26 Sep 2025 05:33:22 GMT
- Title: Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective
- Authors: Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Mingqi Wu, Yanwei Fu, Tao Ji, Tao Gui, Xuanjing Huang, Kai Chen,
- Abstract summary: Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks.<n>While effective at task adaptation, their impact on prior knowledge remains unclear.
- Score: 98.45690529036848
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks. While effective at task adaptation, their impact on prior knowledge remains unclear. In this paper, we introduce jigsaw puzzles as a novel task absent from existing pretraining corpora and systematically study the behavior of SFT and RFT on open-source multimodal model, Qwen2.5-VL series. Our experiments reveal a sharp trade-off: SFT enables rapid task acquisition but leads to catastrophic forgetting, whereas RFT learns more slowly but maintains prior knowledge. We study this phenomenon through learning dynamics by examining both the magnitude and direction of how training data influence prior knowledge. Our analysis shows that RFT mainly reinforces correct samples naturally aligned with the base model's probability landscape, leading to weaker interference with prior knowledge. Moreover, training on RFT-simulated rollouts, which exert a small magnitude of influence and are well aligned in direction to prior knowledge, allows SFT to preserve prior knowledge better while rapidly learning new tasks. These findings suggest that distribution of training data, rather than algorithmic differences, plays a central role in forgetting, and highlight RFT's potential for stable continual learning in multimodal large language models.
Related papers
- Theoretical Perspectives on Data Quality and Synergistic Effects in Pre- and Post-Training Reasoning Models [56.12341509545198]
Large Language Models (LLMs) are pretrained on massive datasets and later instruction-tuned via supervised fine-tuning (SFT) or reinforcement learning (RL)<n>Best practices emphasize large, diverse pretraining data, whereas post-training operates differently.<n>We theoretically analyze transformers trained on an in-context weight prediction task for linear regression.
arXiv Detail & Related papers (2026-03-01T21:58:09Z) - Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning [30.751908700207185]
SFT plays a crucial role across several scenarios.<n>SFT with only 2K achieves comparable or better reasoning performance to RL with 20K.<n>We identify a pervasive issue of deceptive rewards, where higher rewards fail to correlate with better reasoning accuracy in RL.
arXiv Detail & Related papers (2025-12-14T13:46:42Z) - Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking [56.46309219272326]
For large language models (LLMs), classification via supervised fine-tuning (SFT) predicts ''yes'' (resp. ''no'') token for relevant (resp. irrelevant) pairs.<n>This divergence raises a central question: which objective is intrinsically better suited to LLM-based reranking, and what mechanism underlies the difference?<n>We conduct a comprehensive comparison and analysis between CL and SFT for reranking, taking the universal multimodal retrieval (UMR) as the experimental playground.
arXiv Detail & Related papers (2025-10-16T16:02:27Z) - The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs [66.17068546293487]
Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning.<n>We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks.<n>We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones.
arXiv Detail & Related papers (2025-07-10T09:05:49Z) - Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training [23.99424961055015]
This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT)<n>Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks.
arXiv Detail & Related papers (2025-07-07T18:17:06Z) - Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling [35.64557242726578]
Prefix-RFT is a hybrid approach that synergizes learning from both demonstration and exploration.<n>It not only surpasses the performance of standalone SFT and RFT but also outperforms parallel mixed-policy RFT methods.
arXiv Detail & Related papers (2025-07-02T13:04:09Z) - Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality [22.105116314320696]
Supervised fine-tuning (SFT) is a critical step in aligning large language models with human instructions and values.<n>We trained a wide range of base models on a variety of datasets including code generation, mathematical reasoning, and general-domain tasks.<n>We then identified the dataset properties that matter most and examined the layer-wise modifications introduced by SFT.
arXiv Detail & Related papers (2025-06-17T16:13:15Z) - Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [65.36449542323277]
We present a unified theoretical framework bridgingSupervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training.<n>We propose a simple yet effective learning rate reduction approach that yields significant performance improvements.
arXiv Detail & Related papers (2025-06-15T05:42:29Z) - Improved Supervised Fine-Tuning for Large Language Models to Mitigate Catastrophic Forgetting [1.5595148909011116]
Supervised Fine-Tuning (SFT) is a critical step for enhancing the instruction-following capabilities of Large Language Models (LLMs)<n>SFT often leads to a degradation of the model's general abilities, a phenomenon known as catastrophic forgetting.<n>We propose a novel and cost-effective SFT method that effectively mitigates catastrophic forgetting without requiring access to the original SFT data.
arXiv Detail & Related papers (2025-06-11T06:23:50Z) - UFT: Unifying Supervised and Reinforcement Fine-Tuning [21.195897792629548]
We propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process.<n>UFT enables the model to effectively explore solutions while incorporating informative supervision signals.<n>We theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck.
arXiv Detail & Related papers (2025-05-22T17:53:57Z) - Efficient Reinforcement Finetuning via Adaptive Curriculum Learning [24.52451100497884]
Reinforcement finetuning (RFT) has shown great potential for enhancing the mathematical reasoning capabilities of large language models (LLMs)<n>AdaRFT dynamically adjusts the difficulty of training problems based on the model's recent reward signals.<n>AdaRFT reduces training time by up to 2x and improves accuracy by a considerable margin, offering a more scalable and effective RFT framework.
arXiv Detail & Related papers (2025-04-07T21:31:31Z) - OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles [91.88062410741833]
We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning.<n>We show that OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z) - Supervised Fine-Tuning Achieve Rapid Task Adaption Via Alternating Attention Head Activation Patterns [47.57912649802414]
We study the process that the SFT process adapts LLMs to downstream tasks via the perspective of attention patterns.
We find that LLMs selectively activate task-specific attention heads during SFT; (2) activation patterns for complex tasks are combinations of basic task patterns; and (3) changes in a few parameters can significantly impact activation patterns after SFT on a small number of samples.
arXiv Detail & Related papers (2024-09-24T07:34:50Z) - A Closer Look at the Limitations of Instruction Tuning [52.587607091917214]
We show that Instruction Tuning (IT) fails to enhance knowledge or skills in large language models (LLMs)
We also show that popular methods to improve IT do not lead to performance improvements over a simple LoRA fine-tuned model.
Our findings reveal that responses generated solely from pre-trained knowledge consistently outperform responses by models that learn any form of new knowledge from IT on open-source datasets.
arXiv Detail & Related papers (2024-02-03T04:45:25Z) - Task-Distributionally Robust Data-Free Meta-Learning [99.56612787882334]
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data.
For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift ( TDS) and Task-Distribution Corruption (TDC)
arXiv Detail & Related papers (2023-11-23T15:46:54Z) - Instruction Tuning for Large Language Models: A Survey [52.86322823501338]
We make a systematic review of the literature, including the general methodology of supervised fine-tuning (SFT)<n>We also review the potential pitfalls of SFT along with criticism against it, along with efforts pointing out current deficiencies of existing strategies.
arXiv Detail & Related papers (2023-08-21T15:35:16Z) - Supervised Pretraining Can Learn In-Context Reinforcement Learning [96.62869749926415]
In this paper, we study the in-context learning capabilities of transformers in decision-making problems.
We introduce and study Decision-Pretrained Transformer (DPT), a supervised pretraining method where the transformer predicts an optimal action.
We find that the pretrained transformer can be used to solve a range of RL problems in-context, exhibiting both exploration online and conservatism offline.
arXiv Detail & Related papers (2023-06-26T17:58:50Z) - A Survey on Deep Learning based Time Series Analysis with Frequency Transformation [75.63783789488471]
Frequency transformation (FT) has been increasingly incorporated into deep learning models to enhance state-of-the-art accuracy and efficiency in time series analysis.<n>Despite the growing attention and the proliferation of research in this emerging field, there is currently a lack of a systematic review and in-depth analysis of deep learning-based time series models with FT.<n>We present a comprehensive review that systematically investigates and summarizes the recent research advancements in deep learning-based time series analysis with FT.
arXiv Detail & Related papers (2023-02-04T14:33:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.