Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning
- URL: http://arxiv.org/abs/2507.16814v1
- Date: Tue, 22 Jul 2025 17:59:34 GMT
- Title: Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning
- Authors: Junhao Shen, Haiteng Zhao, Yuzhe Gu, Songyang Gao, Kuikun Liu, Haian Huang, Jianfei Gao, Dahua Lin, Wenwei Zhang, Kai Chen,
- Abstract summary: This paper proposes SOPHIA, a simple and scalable Semi-Off-Policy RL for vision-language slow-tHInking reAsoning.<n> SOPHIA builds a semi-off-policy behavior model by combining on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model.<n>Experiments with InternVL2.5 and InternVL3.0 with 8B and 38B sizes show the effectiveness of SOPHIA.
- Score: 50.748123348417984
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy reinforcement learning (RL) to develop the slow thinking ability because the rollout space is restricted by its initial abilities. Off-policy RL offers a way to go beyond the current policy, but directly distilling trajectories from external models may cause visual hallucinations due to mismatched visual perception abilities across models. To address these issues, this paper proposes SOPHIA, a simple and scalable Semi-Off-Policy RL for vision-language slow-tHInking reAsoning. SOPHIA builds a semi-off-policy behavior model by combining on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model, assigns outcome-based rewards to reasoning, and propagates visual rewards backward. Then LVLM learns slow-thinking reasoning ability from the obtained reasoning trajectories using propagated rewards via off-policy RL algorithms. Extensive experiments with InternVL2.5 and InternVL3.0 with 8B and 38B sizes show the effectiveness of SOPHIA. Notably, SOPHIA improves InternVL3.0-38B by 8.50% in average, reaching state-of-the-art performance among open-source LVLMs on multiple multimodal reasoning benchmarks, and even outperforms some closed-source models (e.g., GPT-4.1) on the challenging MathVision and OlympiadBench, achieving 49.08% and 49.95% pass@1 accuracy, respectively. Analysis shows SOPHIA outperforms supervised fine-tuning and direct on-policy RL methods, offering a better policy initialization for further on-policy training.
Related papers
- RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [86.30192066451256]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z) - Perception-Aware Policy Optimization for Multimodal Reasoning [79.56070395437898]
A major source of error in current multimodal reasoning lies in the perception of visual inputs.<n>We propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason.<n>We observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO.
arXiv Detail & Related papers (2025-07-08T23:22:34Z) - Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning [17.421901873720156]
This paper proposes a novel RL framework called textbfVision-EKIPL.<n>It introduces high-quality actions generated by external auxiliary models during the RL training process to guide the optimization of the policy model.<n>It achieves up to a 5% performance improvement on the Reason-RFT-CoT Benchmark compared to the state-of-the-art (SOTA)
arXiv Detail & Related papers (2025-06-07T16:37:46Z) - RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning [125.96848846966087]
Training large language models (LLMs) as interactive agents presents unique challenges.<n>While reinforcement learning has enabled progress in static tasks, multi-turn agent RL training remains underexplored.<n>We propose StarPO, a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents.
arXiv Detail & Related papers (2025-04-24T17:57:08Z) - Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [58.86928947970342]
Embodied-R is a framework combining large-scale Vision-Language Models for perception and small-scale Language Models for reasoning.<n>After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models.<n>Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration.
arXiv Detail & Related papers (2025-04-17T06:16:11Z) - SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models [39.551767637896404]
This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs)<n>We show that SFT can significantly undermine subsequent RL by inducing pseudo reasoning paths'' imitated from expert models.<n>We introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs.
arXiv Detail & Related papers (2025-04-10T16:54:05Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement [91.88062410741833]
This study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs)<n>We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization.<n>OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrates the potential of our strategy for robust vision-language reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z) - FGAIF: Aligning Large Vision-Language Models with Fine-grained AI Feedback [16.24562885483636]
We propose an innovative method to align modalities in Large Vision-Language Models (LVLMs) through Fine-Grained Artificial Intelligence Feedback (FGAIF)<n> Specifically, we first utilize AI tools to predict the types of hallucination for each segment in the response and obtain a collection of fine-grained feedback. Then, based on the collected reward data, three specialized reward models are trained to produce dense rewards. Finally, a novel fine-grained feedback module is integrated into the Proximal Policy Optimization (PPO) algorithm.
arXiv Detail & Related papers (2024-04-07T19:00:45Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.