The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs
- URL: http://arxiv.org/abs/2507.07562v1
- Date: Thu, 10 Jul 2025 09:05:49 GMT
- Title: The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs
- Authors: Jierun Chen, Tiezheng Yu, Haoli Bai, Lewei Yao, Jiannan Wu, Kaican Li, Fei Mi, Chaofan Tao, Lei Zhu, Manyi Zhang, Xiaohui Li, Lu Hou, Lifeng Shang, Qun Liu,
- Abstract summary: Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning.<n>We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks.<n>We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones.
- Score: 66.17068546293487
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning. While these methods exhibit synergy in language-only models, their joint effectiveness in VLMs remains uncertain. We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones. In contrast, RL promotes generalization and brevity, yielding consistent improvements across all difficulty levels, though the improvements on the hardest questions are less prominent compared to SFT. Surprisingly, combining them through two-staged, interleaved, or progressive training strategies, as well as data mixing and model merging, all fails to produce additive benefits, instead leading to trade-offs in accuracy, reasoning style, and response length. This ``synergy dilemma'' highlights the need for more seamless and adaptive approaches to unlock the full potential of combined post-training techniques for reasoning VLMs.
Related papers
- VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning [69.44871115752055]
We propose an advanced multimodal reasoning model trained via a novel Progressive Curriculum Reinforcement Learning (PCuRL) framework.<n>PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts.<n>The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity.
arXiv Detail & Related papers (2025-07-30T12:23:21Z) - Reinforcement Fine-Tuning Enables MLLMs Learning Novel Tasks Stably [80.36077974826865]
Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks.<n>We study the behavior of SFT and RFT on an open-source multimodal model, Qwen2.5-VL.<n>Our experiments reveal a sharp trade-off: SFT enables rapid task acquisition but leads to catastrophic forgetting, whereas RFT learns more slowly on novel tasks but maintains prior knowledge.
arXiv Detail & Related papers (2025-06-30T04:15:01Z) - Empowering Small VLMs to Think with Dynamic Memorization and Exploration [5.2613925143497635]
Small-scale Vision-Language Models (SVLMs) with reliable thinking capabilities remains fundamentally challenging.<n>Existing training paradigms, including Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Reward (RLVR), impose substantial demands on the base VLM.<n>We propose DyME, a novel training paradigm that Dynamically selects between Memorization (via SFT) and Exploration (via RLVR) modes at each optimization step.
arXiv Detail & Related papers (2025-06-29T02:19:51Z) - Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [65.36449542323277]
We present a unified theoretical framework bridgingSupervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training.<n>We propose a simple yet effective learning rate reduction approach that yields significant performance improvements.
arXiv Detail & Related papers (2025-06-15T05:42:29Z) - How Much Backtracking is Enough? Exploring the Interplay of SFT and RL in Enhancing LLM Reasoning [6.92510069380188]
We investigate the dynamics between SFT and RL on eight reasoning tasks.<n>Short CoT sequences used in SFT as a warm-up do have moderate contribution to RL training, compared with cold-start RL.<n>We find that longer CoT with backtracks generally induce better and more stable RL training.
arXiv Detail & Related papers (2025-05-30T06:49:00Z) - TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs [50.820065021136024]
DeepSeek R1 has significantly advanced complex reasoning for large language models (LLMs)<n>Recent methods have attempted to replicate R1's reasoning capabilities in multimodal settings.<n>We propose TACO, a novel reinforcement learning algorithm for visual reasoning.
arXiv Detail & Related papers (2025-05-27T06:30:48Z) - Step-wise Adaptive Integration of Supervised Fine-tuning and Reinforcement Learning for Task-Specific LLMs [13.292104357930866]
SASR is a step-wise adaptive hybrid training framework for large language models.<n>It unifies SFT and RL and dynamically balances the two throughout optimization.<n> Experimental results demonstrate that SASR outperforms SFT, RL, and static hybrid training methods.
arXiv Detail & Related papers (2025-05-19T12:10:17Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement [91.88062410741833]
This study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs)<n>We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization.<n>OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrates the potential of our strategy for robust vision-language reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.