OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement
- URL: http://arxiv.org/abs/2503.17352v1
- Date: Fri, 21 Mar 2025 17:52:43 GMT
- Title: OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement
- Authors: Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, Kai-Wei Chang,
- Abstract summary: This study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs)<n>We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization.<n>OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrates the potential of our strategy for robust vision-language reasoning.
- Score: 91.88062410741833
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements demonstrated by DeepSeek-R1 have shown that complex reasoning abilities in large language models (LLMs), including sophisticated behaviors such as self-verification and self-correction, can be achieved by RL with verifiable rewards and significantly improves model performance on challenging tasks such as AIME. Motivated by these findings, our study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs) and assesses their impact on challenging multimodal reasoning tasks. We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization. Initially, reasoning capabilities were distilled from pure-text R1 models by generating reasoning steps using high-quality captions of the images sourced from diverse visual datasets. Subsequently, iterative RL training further enhance reasoning skills, with each iteration's RL-improved model generating refined SFT datasets for the next round. This iterative process yielded OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrating the potential of our strategy for robust vision-language reasoning. The code, model and data are held at https://github.com/yihedeng9/OpenVLThinker.
Related papers
- Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning [50.748123348417984]
This paper proposes SOPHIA, a simple and scalable Semi-Off-Policy RL for vision-language slow-tHInking reAsoning.<n> SOPHIA builds a semi-off-policy behavior model by combining on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model.<n>Experiments with InternVL2.5 and InternVL3.0 with 8B and 38B sizes show the effectiveness of SOPHIA.
arXiv Detail & Related papers (2025-07-22T17:59:34Z) - The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs [66.17068546293487]
Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning.<n>We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks.<n>We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones.
arXiv Detail & Related papers (2025-07-10T09:05:49Z) - Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective [82.24301452333577]
Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning.<n>A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains.<n>We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains.
arXiv Detail & Related papers (2025-06-17T20:24:00Z) - Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start [24.244577648817188]
"aha moment" patterns are often attributed to emergent properties from reinforcement learning (RL)<n>We present a comprehensive study on enhancing multimodal reasoning through a two-stage approach.<n>Our experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods.
arXiv Detail & Related papers (2025-05-28T13:21:38Z) - Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? [67.30809748319486]
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning capabilities of LLMs.
We re-examine this assumption by measuring the pass@textitk metric with large values of textitk to explore the reasoning capability boundary of the models.
We find that the RL does emphnot, in fact, elicit fundamentally new reasoning patterns.
arXiv Detail & Related papers (2025-04-18T17:59:56Z) - Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [58.86928947970342]
Embodied-R is a framework combining large-scale Vision-Language Models for perception and small-scale Language Models for reasoning.
After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models.
Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration.
arXiv Detail & Related papers (2025-04-17T06:16:11Z) - d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning [31.531278643184656]
Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL)
We propose d1, a framework to adapt pre-trained dLLMs into reasoning models via a combination of supervised finetuning (SFT) and RL.
We find that d1 yields the best performance and significantly improves performance of a state-of-the-art dLLM.
arXiv Detail & Related papers (2025-04-16T16:08:45Z) - VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning [55.97950660659051]
We aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation)<n>We introduce Forced Rethinking, which appends a rethinking trigger token to the end of rollouts in RL training, explicitly enforcing a self-reflection reasoning step.<n>Our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse to achieve 80.4%, 63.5% respectively.
arXiv Detail & Related papers (2025-04-10T17:41:56Z) - SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models [39.551767637896404]
This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs)
We show that SFT can significantly undermine subsequent RL by inducing pseudo reasoning paths'' imitated from expert models.
We introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs.
arXiv Detail & Related papers (2025-04-10T16:54:05Z) - VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model [29.524164786422368]
Recently, DeepSeek R1 has shown that reinforcement learning can substantially improve the reasoning capabilities of Large Language Models (LLMs)
We investigate the extension of R1-style reinforcement learning to Vision-Language Models (VLMs)
We develop VLM-R1, a dedicated framework designed to harness RL for improving VLMs' performance on general vision-language tasks.
arXiv Detail & Related papers (2025-04-10T10:05:15Z) - Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme [36.34443944082215]
This work introduces a transparent, from-scratch framework forReinforcement learning (RL) in vision-based models (VLMs)
It offers a minimal yet functional four-step pipeline validated across multiple models and datasets.
In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors.
arXiv Detail & Related papers (2025-04-03T13:53:28Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.
It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.
Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)
Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - Innate Reasoning is Not Enough: In-Context Learning Enhances Reasoning Large Language Models with Less Overthinking [39.48406368755411]
Large Language Models (LLMs) have introduced Reasoning Large Language Models (RLLMs)<n>RLLMs exhibit innate Chain-of-Thought (CoT) reasoning capability obtained from training, leading to a natural question: "Is CoT prompting necessary to enhance the reasoning capability of RLLMs?"<n>We present the first comprehensive analysis of the impacts of Zero-shot CoT and Few-shot CoT on RLLMs across mathematical reasoning tasks.
arXiv Detail & Related papers (2025-03-25T12:37:22Z) - RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning [11.872929831119661]
We introduce RAG-RL, the first reasoning language model (RLM) specifically trained for retrieval-augmented generation (RAG) settings.<n>RAG-RL demonstrates that stronger answer generation models can identify relevant contexts within larger sets of retrieved information.<n>We show that curriculum design in the reinforcement learning (RL) post-training process is a powerful approach to enhancing model performance.
arXiv Detail & Related papers (2025-03-17T02:53:42Z) - GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [62.536191233049614]
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs)<n>This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld.<n>We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse.
arXiv Detail & Related papers (2025-03-11T15:17:02Z) - SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation [14.786100203787194]
Large language models demonstrate exceptional performance in simple code generation tasks but face challenges in tackling complex problems.
We propose a reasoning-augmented data generation process, SRA-MCTS, which guides the model to autonomously generate high-quality intermediate reasoning paths.
Our method operates entirely through the model itself without requiring additional supervision.
arXiv Detail & Related papers (2024-11-17T12:31:04Z) - On the Modeling Capabilities of Large Language Models for Sequential Decision Making [52.128546842746246]
Large pretrained models are showing increasingly better performance in reasoning and planning tasks.
We evaluate their ability to produce decision-making policies, either directly, by generating actions, or indirectly.
In environments with unfamiliar dynamics, we explore how fine-tuning LLMs with synthetic data can significantly improve their reward modeling capabilities.
arXiv Detail & Related papers (2024-10-08T03:12:57Z) - RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback [24.759613248409167]
Reward engineering has long been a challenge in Reinforcement Learning research.
We propose RL-VLM-F, a method that automatically generates reward functions for agents to learn new tasks.
We demonstrate that RL-VLM-F successfully produces effective rewards and policies across various domains.
arXiv Detail & Related papers (2024-02-06T04:06:06Z) - Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs.
We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency.
We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z) - INFOrmation Prioritization through EmPOWERment in Visual Model-Based RL [90.06845886194235]
We propose a modified objective for model-based reinforcement learning (RL)
We integrate a term inspired by variational empowerment into a state-space model based on mutual information.
We evaluate the approach on a suite of vision-based robot control tasks with natural video backgrounds.
arXiv Detail & Related papers (2022-04-18T23:09:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.