Related papers: Reinforcement Fine-Tuning Enables MLLMs Learning Novel Tasks Stably

Reinforcement Fine-Tuning Enables MLLMs Learning Novel Tasks Stably

URL: http://arxiv.org/abs/2506.23508v1
Date: Mon, 30 Jun 2025 04:15:01 GMT
Title: Reinforcement Fine-Tuning Enables MLLMs Learning Novel Tasks Stably
Authors: Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Yanwei Fu, Tao Ji, Tao Gui, Xuanjing Huang,
Abstract summary: Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks.<n>We study the behavior of SFT and RFT on an open-source multimodal model, Qwen2.5-VL.<n>Our experiments reveal a sharp trade-off: SFT enables rapid task acquisition but leads to catastrophic forgetting, whereas RFT learns more slowly on novel tasks but maintains prior knowledge.
Score: 80.36077974826865
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks. While effective at task adaptation, their impact on prior knowledge remains unclear. In this paper, we introduce jigsaw puzzles as a novel task absent from existing pretraining corpora and systematically study the behavior of SFT and RFT on an open-source multimodal model, Qwen2.5-VL. Our experiments reveal a sharp trade-off: SFT enables rapid task acquisition but leads to catastrophic forgetting, whereas RFT learns more slowly on novel tasks but maintains prior knowledge. We analyze this phenomenon through the lens of learning dynamics, showing that RFT reinforces correct samples that are naturally aligned with the base model's probability landscape, mitigating interference with prior knowledge. Moreover, supervised training on correct RFT-simulated rollouts allows SFT to preserve knowledge while rapidly learning new tasks. These findings suggest that data distribution, rather than algorithmic differences, plays a central role in forgetting, and highlight RFT's potential for stable continual learning in multimodal large language models.

Related papers

The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs [66.17068546293487]
Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning.<n>We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks.<n>We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones.
arXiv Detail & Related papers (2025-07-10T09:05:49Z)
Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training [23.99424961055015]
This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT)<n>Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks.
arXiv Detail & Related papers (2025-07-07T18:17:06Z)
Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling [35.64557242726578]
Prefix-RFT is a hybrid approach that synergizes learning from both demonstration and exploration.<n>It not only surpasses the performance of standalone SFT and RFT but also outperforms parallel mixed-policy RFT methods.
arXiv Detail & Related papers (2025-07-02T13:04:09Z)
Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [65.36449542323277]
We present a unified theoretical framework bridgingSupervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training.<n>We propose a simple yet effective learning rate reduction approach that yields significant performance improvements.
arXiv Detail & Related papers (2025-06-15T05:42:29Z)
Improved Supervised Fine-Tuning for Large Language Models to Mitigate Catastrophic Forgetting [1.5595148909011116]
Supervised Fine-Tuning (SFT) is a critical step for enhancing the instruction-following capabilities of Large Language Models (LLMs)<n>SFT often leads to a degradation of the model's general abilities, a phenomenon known as catastrophic forgetting.<n>We propose a novel and cost-effective SFT method that effectively mitigates catastrophic forgetting without requiring access to the original SFT data.
arXiv Detail & Related papers (2025-06-11T06:23:50Z)
UFT: Unifying Supervised and Reinforcement Fine-Tuning [21.195897792629548]
We propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process.<n>UFT enables the model to effectively explore solutions while incorporating informative supervision signals.<n>We theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck.
arXiv Detail & Related papers (2025-05-22T17:53:57Z)
Efficient Reinforcement Finetuning via Adaptive Curriculum Learning [24.52451100497884]
Reinforcement finetuning (RFT) has shown great potential for enhancing the mathematical reasoning capabilities of large language models (LLMs)<n>AdaRFT dynamically adjusts the difficulty of training problems based on the model's recent reward signals.<n>AdaRFT reduces training time by up to 2x and improves accuracy by a considerable margin, offering a more scalable and effective RFT framework.
arXiv Detail & Related papers (2025-04-07T21:31:31Z)
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles [91.88062410741833]
We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning.<n>We show that OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z)
Supervised Fine-Tuning Achieve Rapid Task Adaption Via Alternating Attention Head Activation Patterns [47.57912649802414]
We study the process that the SFT process adapts LLMs to downstream tasks via the perspective of attention patterns. We find that LLMs selectively activate task-specific attention heads during SFT; (2) activation patterns for complex tasks are combinations of basic task patterns; and (3) changes in a few parameters can significantly impact activation patterns after SFT on a small number of samples.
arXiv Detail & Related papers (2024-09-24T07:34:50Z)
Task-Distributionally Robust Data-Free Meta-Learning [99.56612787882334]
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data. For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift ( TDS) and Task-Distribution Corruption (TDC)
arXiv Detail & Related papers (2023-11-23T15:46:54Z)
Instruction Tuning for Large Language Models: A Survey [52.86322823501338]
We make a systematic review of the literature, including the general methodology of supervised fine-tuning (SFT)<n>We also review the potential pitfalls of SFT along with criticism against it, along with efforts pointing out current deficiencies of existing strategies.
arXiv Detail & Related papers (2023-08-21T15:35:16Z)
A Survey on Deep Learning based Time Series Analysis with Frequency Transformation [75.63783789488471]
Frequency transformation (FT) has been increasingly incorporated into deep learning models to enhance state-of-the-art accuracy and efficiency in time series analysis.<n>Despite the growing attention and the proliferation of research in this emerging field, there is currently a lack of a systematic review and in-depth analysis of deep learning-based time series models with FT.<n>We present a comprehensive review that systematically investigates and summarizes the recent research advancements in deep learning-based time series analysis with FT.
arXiv Detail & Related papers (2023-02-04T14:33:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.