Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
- URL: http://arxiv.org/abs/2505.03318v3
- Date: Wed, 29 Oct 2025 04:02:02 GMT
- Title: Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
- Authors: Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang,
- Abstract summary: This paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model.<n>We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o.<n>We then prepare large-scale unified multimodal preference data to elicit the model's reasoning process across various vision tasks.
- Score: 31.727984223052648
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement fine-tuning approach to elicit and incentivize the model's latent complex reasoning ability: (1) We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o, which is then used for the model's cold start to learn the format and structure of CoT reasoning. (2) Subsequently, by leveraging the model's prior knowledge and generalization capabilities, we prepare large-scale unified multimodal preference data to elicit the model's reasoning process across various vision tasks. During this phase, correct reasoning outputs are retained for rejection sampling to refine the model (3) while incorrect predicted samples are finally used for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning, enabling the model to explore diverse reasoning paths and optimize for correct and robust solutions. Extensive experiments across various vision reward tasks demonstrate the superiority of our model.
Related papers
- DiffuReason: Bridging Latent Reasoning and Generative Refinement for Sequential Recommendation [20.756497463882763]
We propose DiffuReason, a unified "Think-then-Diffuse" framework for sequential recommendation.<n>It integrates multi-step Thinking Tokens for latent reasoning, diffusion-based refinement for denoising intermediate representations, and end-to-end Group Relative Policy Optimization.<n>Experiments on four benchmarks demonstrate that DiffuReason consistently improves diverse backbone architectures.
arXiv Detail & Related papers (2026-02-10T12:55:30Z) - Unified Personalized Reward Model for Vision Generation [27.496220369122494]
We propose UnifiedReward-Flex, a unified personalized reward model for vision generation.<n>We first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT.<n>We then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment.
arXiv Detail & Related papers (2026-02-02T17:44:21Z) - Discovering Process-Outcome Credit in Multi-Step LLM Reasoning [3.584086358722852]
Reinforcement Learning (RL) serves as a potent paradigm for enhancing reasoning capabilities in Large Language Models (LLMs)<n>We propose a novel framework designed to provide continuous reward signals.<n>Our model exhibits superior out-of-distribution robustness, demonstrating promising zero-shot transfer capabilities to unseen and challenging reasoning tasks.
arXiv Detail & Related papers (2026-02-01T05:44:09Z) - Adversarial Yet Cooperative: Multi-Perspective Reasoning in Retrieved-Augmented Language Models [72.4149653187766]
We propose a Reasoner-Verifier framework named Adrialversa Reasoning RAG (ARR)<n>The Reasoner and Verifier engage in reasoning on retrieved evidence and critiquing each other's logic while being guided by process-aware advantage.<n> Experiments on multiple benchmarks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2026-01-08T06:57:03Z) - Training Multimodal Large Reasoning Models Needs Better Thoughts: A Three-Stage Framework for Long Chain-of-Thought Synthesis and Selection [64.34737012956182]
Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning tasks through long Chain-of-Thought (CoT) reasoning.<n>Existing multimodal datasets and CoT methods still suffer from limited reasoning depth, modality conversion errors, and rigid generation pipelines.<n>We propose SynSelect, a novel three-stage Synthesis-Selection framework for generating high-quality long CoT data tailored to multimodal reasoning tasks.
arXiv Detail & Related papers (2025-12-22T02:07:20Z) - The Thinking Spectrum: An Empirical Study of Tunable Reasoning in LLMs through Model Merging [8.930191971732649]
We present a large-scale empirical study evaluating a range of model merging techniques across multiple reasoning benchmarks.<n>Our findings reveal that model merging offers an effective and controllable method for calibrating the trade-off between reasoning accuracy and token efficiency.<n>Our study provides the first comprehensive analysis of this tunable space, offering practical guidelines for creating LLMs with specific reasoning profiles.
arXiv Detail & Related papers (2025-09-26T08:12:13Z) - Don't Overthink It: A Survey of Efficient R1-style Large Reasoning Models [49.598776427454176]
Large Reasoning Models (LRMs) have gradually become a research hotspot due to their outstanding performance in handling complex tasks.<n>However, with the widespread application of these models, the problem of overthinking has gradually emerged.<n>Various efficient reasoning methods have been proposed, aiming to reduce the length of reasoning paths without compromising model performance and reasoning capability.
arXiv Detail & Related papers (2025-08-04T06:54:31Z) - Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning [10.255235456427037]
We propose a simple yet effective two-stage reinforcement learning framework for achieving concise reasoning in Large Language Models (LLMs)<n>The first stage, using more training steps, aims to incentivize the model's reasoning capabilities via Group Relative Policy Optimization.<n>The second stage, using fewer training steps, explicitly enforces conciseness and improves efficiency via Length-aware Group Relative Policy Optimization.
arXiv Detail & Related papers (2025-05-27T13:29:51Z) - Extended Inductive Reasoning for Personalized Preference Inference from Behavioral Signals [45.019257216564036]
This paper investigates extended inductive reasoning in large language models (LLMs)<n>We propose AlignXplore, a model that enables systematic preference inference from behavioral signals in users' interaction histories.<n>We show that AlignXplore achieves substantial improvements over the backbone model by an average of 15.49% on in-domain and out-of-domain benchmarks.
arXiv Detail & Related papers (2025-05-23T16:16:46Z) - LARES: Latent Reasoning for Sequential Recommendation [96.26996622771593]
We present LARES, a novel and scalable LAtent REasoning framework for Sequential recommendation.<n>Our proposed approach employs a recurrent architecture that allows flexible expansion of reasoning depth without increasing parameter complexity.<n>Our framework exhibits seamless compatibility with existing advanced models, further improving their recommendation performance.
arXiv Detail & Related papers (2025-05-22T16:22:54Z) - RM-R1: Reward Modeling as Reasoning [81.50471199906738]
We introduce a new class of generative reward models -- Reward Reasoning Models (ReasRMs)<n>We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1.<n>Our models achieve state-of-the-art or near state-of-the-art performance of generative RMs across multiple benchmarks.
arXiv Detail & Related papers (2025-05-05T06:11:12Z) - Progressive Multimodal Reasoning via Active Retrieval [64.74746997923967]
Multi-step multimodal reasoning tasks pose significant challenges for large language models (MLLMs)<n>We propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs.<n>We show that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.
arXiv Detail & Related papers (2024-12-19T13:25:39Z) - Rethinking Bradley-Terry Models in Preference-Based Reward Modeling: Foundations, Theory, and Alternatives [14.401557416713315]
We revisit the foundations of using Bradley-Terry (BT) models in reward modeling.<n>We argue that the BT model is not a necessary choice from the perspective of downstream optimization.<n>We propose a simple and straightforward upper-bound algorithm, compatible with off-the-shelf binary classifiers.
arXiv Detail & Related papers (2024-11-07T18:57:03Z) - Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training [49.3242278912771]
Multimodal reasoning is a challenging task that requires models to reason across multiple modalities to answer questions.
Existing approaches have made progress by incorporating language and visual modalities into a two-stage reasoning framework.
We propose MC-CoT, a self-consistency training strategy that generates multiple rationales and answers, subsequently selecting the most accurate through a voting process.
arXiv Detail & Related papers (2023-11-23T17:09:48Z) - Let's reward step by step: Step-Level reward model as the Navigators for
Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase.
We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs.
To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.