GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training
- URL: http://arxiv.org/abs/2503.08525v1
- Date: Tue, 11 Mar 2025 15:17:02 GMT
- Title: GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training
- Authors: Tong Wei, Yijun Yang, Junliang Xing, Yuanchun Shi, Zongqing Lu, Deheng Ye,
- Abstract summary: Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs)<n>This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld.<n>We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse.
- Score: 62.536191233049614
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs). Yet, its efficacy in training vision-language model (VLM) agents for goal-directed action reasoning in visual environments is less established. This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld. We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse, characterized by a rapid loss of diversity in the agent's thoughts, state-irrelevant and incomplete reasoning, and subsequent invalid actions, resulting in negative rewards. To counteract thought collapse, we highlight the necessity of process guidance and propose an automated corrector that evaluates and refines the agent's reasoning at each RL step. This simple and scalable GTR (Guided Thought Reinforcement) framework trains reasoning and action simultaneously without the need for dense, per-step human labeling. Our experiments demonstrate that GTR significantly enhances the performance and generalization of the LLaVA-7b model across various visual environments, achieving 3-5 times higher task success rates compared to SoTA models with notably smaller model sizes.
Related papers
- RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning [125.65034908728828]
Training large language models (LLMs) as interactive agents presents unique challenges.
While reinforcement learning has enabled progress in static tasks, multi-turn agent RL training remains underexplored.
We propose StarPO, a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents.
arXiv Detail & Related papers (2025-04-24T17:57:08Z) - Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme [36.34443944082215]
This work introduces a transparent, from-scratch framework forReinforcement learning (RL) in vision-based models (VLMs)
It offers a minimal yet functional four-step pipeline validated across multiple models and datasets.
In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors.
arXiv Detail & Related papers (2025-04-03T13:53:28Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.
It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.
Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)
Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement [91.88062410741833]
This study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs)
We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization.
OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrates the potential of our strategy for robust vision-language reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z) - LaMOuR: Leveraging Language Models for Out-of-Distribution Recovery in Reinforcement Learning [16.093659272414527]
We introduce Language Models for Out-of-Distribution Recovery (LaMOuR), which enables recovery learning without relying on uncertainty estimation.
LaMOuR generates dense reward codes that guide the agent back to a state where it can successfully perform its original task.
Experimental results show that LaMOuR substantially enhances recovery efficiency across diverse locomotion tasks.
arXiv Detail & Related papers (2025-03-21T13:20:39Z) - Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning [8.665713419757061]
We investigate the thinking process in rule-based reinforcement learning fine-tuning (RFT) for multi-modal large language models (MLLMs)
We first propose CLS-RL for classification, using verifiable rewards to encourage MLLM thinking.
Experiments show CLS-RL significantly outperforms SFT and yields a 'free-lunch' generalization effect (improving performance on unseen datasets after training on one dataset).
We then question if this explicit thinking is always necessary for RFT. Challenging convention that explicit thinking is crucial for RFT, we introduce No-Thinking-RL, minimizing thinking via a simple equality accuracy reward.
arXiv Detail & Related papers (2025-03-20T14:37:45Z) - ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning [54.787341008881036]
We introduce Reinforced Meta-thinking Agents (ReMA), a novel framework that leverages Multi-Agent Reinforcement Learning (MARL) to elicit meta-thinking behaviors.
ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions.
Experimental results demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks.
arXiv Detail & Related papers (2025-03-12T16:05:31Z) - InDRiVE: Intrinsic Disagreement based Reinforcement for Vehicle Exploration through Curiosity Driven Generalized World Model [0.0]
In this paper, we propose InDRiVE (Intrinsic Disagreement based Reinforcement for Vehicle Exploration) as a model-based Reinforcement Learning framework.<n>By training an ensemble of world models, the agent actively explores high uncertainty regions of environments without task specific feedback.<n> Experimental results in both seen and unseen environments demonstrate that InDRiVE achieves higher success rates and fewer infractions compared to DreamerV2 and DreamerV3 baselines.
arXiv Detail & Related papers (2025-03-07T16:56:00Z) - Reusing Embeddings: Reproducible Reward Model Research in Large Language Model Alignment without GPUs [58.18140409409302]
Large Language Models (LLMs) have made substantial strides in structured tasks through Reinforcement Learning (RL)<n>Applying RL in broader domains like chatbots and content generation presents unique challenges.<n>We show a case study of reproducing existing reward model ensemble research using embedding-based reward models.
arXiv Detail & Related papers (2025-02-04T19:37:35Z) - RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback [24.759613248409167]
Reward engineering has long been a challenge in Reinforcement Learning research.
We propose RL-VLM-F, a method that automatically generates reward functions for agents to learn new tasks.
We demonstrate that RL-VLM-F successfully produces effective rewards and policies across various domains.
arXiv Detail & Related papers (2024-02-06T04:06:06Z) - Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint [104.53687944498155]
Reinforcement learning (RL) has been widely used in training large language models (LLMs)
We propose a new RL method named RLMEC that incorporates a generative model as the reward model.
Based on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process.
arXiv Detail & Related papers (2024-01-11T17:58:41Z) - Leveraging Reward Consistency for Interpretable Feature Discovery in
Reinforcement Learning [69.19840497497503]
It is argued that the commonly used action matching principle is more like an explanation of deep neural networks (DNNs) than the interpretation of RL agents.
We propose to consider rewards, the essential objective of RL agents, as the essential objective of interpreting RL agents.
We verify and evaluate our method on the Atari 2600 games as well as Duckietown, a challenging self-driving car simulator environment.
arXiv Detail & Related papers (2023-09-04T09:09:54Z) - Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels [112.63440666617494]
Reinforcement learning algorithms can succeed but require large amounts of interactions between the agent and the environment.
We propose a new method to solve it, using unsupervised model-based RL, for pre-training the agent.
We show robust performance on the Real-Word RL benchmark, hinting at resiliency to environment perturbations during adaptation.
arXiv Detail & Related papers (2022-09-24T14:22:29Z) - Residual Reinforcement Learning from Demonstrations [51.56457466788513]
Residual reinforcement learning (RL) has been proposed as a way to solve challenging robotic tasks by adapting control actions from a conventional feedback controller to maximize a reward signal.
We extend the residual formulation to learn from visual inputs and sparse rewards using demonstrations.
Our experimental evaluation on simulated manipulation tasks on a 6-DoF UR5 arm and a 28-DoF dexterous hand demonstrates that residual RL from demonstrations is able to generalize to unseen environment conditions more flexibly than either behavioral cloning or RL fine-tuning.
arXiv Detail & Related papers (2021-06-15T11:16:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.