VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
- URL: http://arxiv.org/abs/2504.08837v1
- Date: Thu, 10 Apr 2025 17:41:56 GMT
- Title: VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
- Authors: Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, Wenhu Chen,
- Abstract summary: GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection.<n>In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning.<n>Our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse, and MathVision to achieve 80.3%, 61.8%, and 43.9% respectively.
- Score: 55.97950660659051
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a textual rethinking trigger to the end of initial rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse, and MathVision to achieve 80.3%, 61.8%, and 43.9% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with GPT-o1.
Related papers
- Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [58.86928947970342]
Embodied-R is a framework combining large-scale Vision-Language Models for perception and small-scale Language Models for reasoning.
After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models.
Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration.
arXiv Detail & Related papers (2025-04-17T06:16:11Z) - SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models [39.551767637896404]
This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs)
We show that SFT can significantly undermine subsequent RL by inducing pseudo reasoning paths'' imitated from expert models.
We introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs.
arXiv Detail & Related papers (2025-04-10T16:54:05Z) - OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement [91.88062410741833]
This study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs)<n>We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization.<n>OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrates the potential of our strategy for robust vision-language reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z) - Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning [8.665713419757061]
We investigate the thinking process in rule-based reinforcement learning fine-tuning (RFT) for multi-modal large language models (MLLMs)
We first propose CLS-RL for classification, using verifiable rewards to encourage MLLM thinking.
Experiments show CLS-RL significantly outperforms SFT and yields a 'free-lunch' generalization effect (improving performance on unseen datasets after training on one dataset).
We then question if this explicit thinking is always necessary for RFT. Challenging convention that explicit thinking is crucial for RFT, we introduce No-Thinking-RL, minimizing thinking via a simple equality accuracy reward.
arXiv Detail & Related papers (2025-03-20T14:37:45Z) - GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [62.536191233049614]
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs)<n>This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld.<n>We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse.
arXiv Detail & Related papers (2025-03-11T15:17:02Z) - Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models [42.70951894754312]
Integration of slow-thinking mechanisms into large language models offers a promising way toward Level 2 AGI Reasoners.<n>We propose a self-backtracking mechanism that equips LLMs with the ability to backtrack during both training and inference.<n>This mechanism not only enhances reasoning ability but also efficiency by transforming slow-thinking processes into fast-thinking through self-improvement.
arXiv Detail & Related papers (2025-02-06T08:52:43Z) - Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference.<n>This paper presents the first comprehensive study on the prevalent issue of overthinking in these models.<n>We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z) - A Critical Evaluation of AI Feedback for Aligning Large Language Models [60.42291111149438]
We show that simple supervised fine-tuning with GPT-4 as the teacher outperforms existing RLAIF pipelines.
More generally, we find that the gains from RLAIF vary substantially across base model families, test-time evaluation protocols, and critic models.
arXiv Detail & Related papers (2024-02-19T18:53:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.