VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
- URL: http://arxiv.org/abs/2504.08837v3
- Date: Thu, 08 May 2025 06:35:06 GMT
- Title: VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
- Authors: Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, Wenhu Chen,
- Abstract summary: We aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation)<n>We introduce Forced Rethinking, which appends a rethinking trigger token to the end of rollouts in RL training, explicitly enforcing a self-reflection reasoning step.<n>Our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse to achieve 80.4%, 63.5% respectively.
- Score: 55.97950660659051
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a rethinking trigger token to the end of rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse to achieve 80.4%, 63.5% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MathVision, MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with OpenAI-o1. Our empirical results show the effectiveness of our approaches.
Related papers
- ProxyThinker: Test-Time Guidance through Small Visual Reasoners [15.901647765066784]
We propose ProxyThinker, an inference-time technique that enables large models to inherit the visual reasoning capabilities from small, slow-thinking visual reasoners without any training.<n>By subtracting the output of base models from those of RFT reasoners, ProxyThinker elicits the slow-thinking reasoning demonstrated by the emerged behaviors such as self-verification and self-correction.<n>Our implementation efficiently coordinates multiple language models with parallelism techniques and achieves up to 38 $times$ faster inference compared to previous decoding-time methods.
arXiv Detail & Related papers (2025-05-30T17:59:43Z) - SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward [9.717022695892137]
We propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm.<n>To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process.<n>Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks.
arXiv Detail & Related papers (2025-05-22T17:59:53Z) - Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models [45.33952788910874]
TON is a two-stage training strategy for vision-language models.<n>It introduces a think-or-not format that serves as a cold start for selective reasoning.<n>TON can reduce the completion length by up to 90% compared to vanilla GRPO.
arXiv Detail & Related papers (2025-05-22T16:13:29Z) - J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning [69.14405906946634]
We introduce J1, a reinforcement learning approach to training such models.<n>Our method converts both verifiable and non-verifiable prompts to judgment tasks with verifiable rewards that incentivize thinking and mitigate judgment bias.<n>We find that our models make better judgments by learning to outline evaluation criteria, comparing against self-generated reference answers, and re-evaluating the correctness of model responses.
arXiv Detail & Related papers (2025-05-15T14:05:15Z) - Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [58.86928947970342]
Embodied-R is a framework combining large-scale Vision-Language Models for perception and small-scale Language Models for reasoning.
After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models.
Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration.
arXiv Detail & Related papers (2025-04-17T06:16:11Z) - SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models [39.551767637896404]
This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs)
We show that SFT can significantly undermine subsequent RL by inducing pseudo reasoning paths'' imitated from expert models.
We introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs.
arXiv Detail & Related papers (2025-04-10T16:54:05Z) - OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement [91.88062410741833]
This study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs)<n>We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization.<n>OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrates the potential of our strategy for robust vision-language reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z) - Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning [8.665713419757061]
We investigate the thinking process in rule-based reinforcement learning fine-tuning (RFT) for multi-modal large language models (MLLMs)
We first propose CLS-RL for classification, using verifiable rewards to encourage MLLM thinking.
Experiments show CLS-RL significantly outperforms SFT and yields a 'free-lunch' generalization effect (improving performance on unseen datasets after training on one dataset).
We then question if this explicit thinking is always necessary for RFT. Challenging convention that explicit thinking is crucial for RFT, we introduce No-Thinking-RL, minimizing thinking via a simple equality accuracy reward.
arXiv Detail & Related papers (2025-03-20T14:37:45Z) - GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [62.536191233049614]
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs)<n>This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld.<n>We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse.
arXiv Detail & Related papers (2025-03-11T15:17:02Z) - Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models [42.70951894754312]
Integration of slow-thinking mechanisms into large language models offers a promising way toward Level 2 AGI Reasoners.<n>We propose a self-backtracking mechanism that equips LLMs with the ability to backtrack during both training and inference.<n>This mechanism not only enhances reasoning ability but also efficiency by transforming slow-thinking processes into fast-thinking through self-improvement.
arXiv Detail & Related papers (2025-02-06T08:52:43Z) - Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference.<n>This paper presents the first comprehensive study on the prevalent issue of overthinking in these models.<n>We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z) - A Critical Evaluation of AI Feedback for Aligning Large Language Models [60.42291111149438]
We show that simple supervised fine-tuning with GPT-4 as the teacher outperforms existing RLAIF pipelines.
More generally, we find that the gains from RLAIF vary substantially across base model families, test-time evaluation protocols, and critic models.
arXiv Detail & Related papers (2024-02-19T18:53:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.