Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards
- URL: http://arxiv.org/abs/2509.19003v1
- Date: Tue, 23 Sep 2025 13:47:32 GMT
- Title: Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards
- Authors: Honghao Chen, Xingzhou Lou, Xiaokun Feng, Kaiqi Huang, Xinlong Wang,
- Abstract summary: We present chain of step reasoning for vision-language models, enabling assessing reasoning step quality accurately.<n>We present a simple, effective, and fully transparent framework, including the step-level reasoning data, process reward model (PRM), and reinforcement learning training.<n>We believe this paper serves as a baseline for vision-language models and offers insights into more complex multimodal reasoning.
- Score: 48.55501117313608
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chain of thought reasoning has demonstrated remarkable success in large language models, yet its adaptation to vision-language reasoning remains an open challenge with unclear best practices. Existing attempts typically employ reasoning chains at a coarse-grained level, which struggles to perform fine-grained structured reasoning and, more importantly, are difficult to evaluate the reward and quality of intermediate reasoning. In this work, we delve into chain of step reasoning for vision-language models, enabling assessing reasoning step quality accurately and leading to effective reinforcement learning and inference-time scaling with fine-grained rewards. We present a simple, effective, and fully transparent framework, including the step-level reasoning data, process reward model (PRM), and reinforcement learning training. With the proposed approaches, our models set strong baselines with consistent improvements on challenging vision-language benchmarks. More importantly, we conduct a thorough empirical analysis and ablation study, unveiling the impact of each component and several intriguing properties of inference-time scaling. We believe this paper serves as a baseline for vision-language models and offers insights into more complex multimodal reasoning. Our dataset, PRM, and code will be available at https://github.com/baaivision/CoS.
Related papers
- Reward Modeling for Reinforcement Learning-Based LLM Reasoning: Design, Challenges, and Evaluation [46.38008143057758]
Large Language Models (LLMs) demonstrate transformative potential, yet their reasoning remains inconsistent and unreliable.<n>This work argues that reward modeling is not merely an implementation detail but a central architect of reasoning alignment.<n>Within this framework, we present a taxonomy of reward mechanisms, analyze reward hacking as a pervasive failure mode, and examine how reward signals unify challenges.
arXiv Detail & Related papers (2026-02-10T00:45:24Z) - See, Think, Learn: A Self-Taught Multimodal Reasoner [3.443084677278651]
We propose a simple yet effective self-training framework called See-Think-Learn.<n>At its core, STL introduces a structured reasoning template that encourages the model to see before thinking.<n>We augment the training data with negative rationales to enhance the model's ability to distinguish between correct and misleading responses.
arXiv Detail & Related papers (2025-12-02T06:30:10Z) - CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding [13.628041236679229]
Vision Language Models (VLMs) have recently shown significant advancements in video understanding, but their capability for counterfactual reasoning remains underexplored.<n>We introduce CounterVQA, a video-based benchmark featuring three progressive difficulty levels that assess different aspects of counterfactual reasoning.<n>We develop a post-training method, CFGPT, that enhances a model's visual counterfactual reasoning ability by distilling its counterfactual reasoning capability from the language modality.
arXiv Detail & Related papers (2025-11-25T04:59:55Z) - Latent Chain-of-Thought for Visual Reasoning [53.541579327424046]
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs)<n>We reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference.<n>We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks.
arXiv Detail & Related papers (2025-10-27T23:10:06Z) - Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning [50.20267980386502]
We learn a dense, token-level reward model for process supervision directly from expert demonstrations.<n>The learned reasoning reward serves two complementary roles: (i) it provides step-level feedback to optimise a reasoning policy during training; and (ii) it functions at inference as a critic to rerank sampled traces under fixed compute budgets.
arXiv Detail & Related papers (2025-10-02T09:55:26Z) - Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning [41.255832127671205]
We propose Audio-Thinker, a reinforcement learning framework designed to enhance the reasoning capabilities of large audio language models (LALMs)<n>Our approach introduces an adaptive think accuracy reward, enabling the model to adjust its reasoning strategies based on task dynamically.<n> Experimental results demonstrate that our Audio-Thinker model outperforms existing reasoning-oriented LALMs across various benchmark tasks.
arXiv Detail & Related papers (2025-08-11T14:41:10Z) - Dancing with Critiques: Enhancing LLM Reasoning with Stepwise Natural Language Self-Critique [66.94905631175209]
We propose a novel inference-time scaling approach -- stepwise natural language self-critique (PANEL)<n>It employs self-generated natural language critiques as feedback to guide the step-level search process.<n>This approach bypasses the need for task-specific verifiers and the associated training overhead.
arXiv Detail & Related papers (2025-03-21T17:59:55Z) - OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles [91.88062410741833]
We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning.<n>We show that OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z) - Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z) - Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs.
We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency.
We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.