Related papers: SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

URL: http://arxiv.org/abs/2505.17018v1
Date: Thu, 22 May 2025 17:59:53 GMT
Title: SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
Authors: Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, Xiangyu Yue,
Abstract summary: We propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm.<n>To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process.<n>Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks.
Score: 9.717022695892137
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances have shown success in eliciting strong reasoning abilities in multimodal large language models (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards. However, this paradigm typically lacks supervision over the thinking process leading to the final outcome.As a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability. In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm. To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process. Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training. This weight is computed based on the thinking reward comparison of responses leading to correct answers versus incorrect answers, helping to mitigate the impact of potentially unreliable thinking rewards. Moreover, we design an annealing training strategy that gradually reduces the thinking reward over time, allowing the model to rely more on the accurate rule-based outcome reward in later training stages. Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks (e.g., MathVisita, MMMU), demonstrating strong reasoning and generalization capabilities. Notably, our SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite the latter having 10 times more parameters. All code, models, and datasets are made publicly available at https://github.com/kxfan2002/SophiaVL-R1.

Related papers

Residual Reward Models for Preference-based Reinforcement Learning [11.797520525358564]
Preference-based Reinforcement Learning (PbRL) provides a way to learn high-performance policies in environments where the reward signal is hard to specify.<n>PbRL can suffer from slow convergence speed since it requires training in a reward model.<n>We propose a method to effectively leverage prior knowledge with a Residual Reward Model (RRM)
arXiv Detail & Related papers (2025-07-01T09:43:57Z)
Generalist Reward Models: Found Inside Large Language Models [50.7432354447554]
We show that a powerful reward model is already latently present within any Large Language Models (LLMs) trained via standard next-token prediction.<n>We prove that this endogenous reward is not a reward function learned through offline inverse reinforcement learning.<n>We also prove that subsequent reinforcement learning using this endogenous reward leads to a policy with a provably superior error bound compared to the base model.
arXiv Detail & Related papers (2025-06-29T13:45:54Z)
Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective [6.069069082518759]
We study the Zero-Reward Assumption in reinforcement learning for large language models (LLMs)<n>We show that the policy gradient based on true, unknown token-level rewards can be unbiasedly estimated using only a response-level reward model.<n>We propose a new algorithm: Token-Reinforced Policy Optimization (TRePO)
arXiv Detail & Related papers (2025-06-03T07:44:31Z)
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models [45.33952788910874]
TON is a two-stage training strategy for vision-language models.<n>It introduces a think-or-not format that serves as a cold start for selective reasoning.<n>TON can reduce the completion length by up to 90% compared to vanilla GRPO.
arXiv Detail & Related papers (2025-05-22T16:13:29Z)
RM-R1: Reward Modeling as Reasoning [81.50471199906738]
Reasoning Reward Models (ReasRMs) formulate reward modeling as a reasoning task.<n>We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1.<n>Our models achieve state-of-the-art performance across three reward model benchmarks on average.
arXiv Detail & Related papers (2025-05-05T06:11:12Z)
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning [55.97950660659051]
We aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation)<n>We introduce Forced Rethinking, which appends a rethinking trigger token to the end of rollouts in RL training, explicitly enforcing a self-reflection reasoning step.<n>Our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse to achieve 80.4%, 63.5% respectively.
arXiv Detail & Related papers (2025-04-10T17:41:56Z)
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [62.536191233049614]
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs)<n>This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld.<n>We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse.
arXiv Detail & Related papers (2025-03-11T15:17:02Z)
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems [54.4392552373835]
Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs)<n>We propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals to provide reliable rewards.<n>We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks.
arXiv Detail & Related papers (2025-02-26T17:19:12Z)
Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output. We use these attention weights to redistribute the reward along the whole completion. Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.