Latent Chain-of-Thought for Visual Reasoning
- URL: http://arxiv.org/abs/2510.23925v2
- Date: Wed, 29 Oct 2025 18:48:20 GMT
- Title: Latent Chain-of-Thought for Visual Reasoning
- Authors: Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao,
- Abstract summary: Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs)<n>We reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference.<n>We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks.
- Score: 53.541579327424046
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.
Related papers
- Decoding Rewards in Competitive Games: Inverse Game Theory with Entropy Regularization [52.74762030521324]
We propose a novel algorithm to learn reward functions from observed actions.<n>We provide strong theoretical guarantees for the reliability and sample efficiency of our algorithm.
arXiv Detail & Related papers (2026-01-19T04:12:51Z) - Efficient Reinforcement Learning with Semantic and Token Entropy for LLM Reasoning [30.889495810312624]
We propose an efficient reinforcement learning framework that leverages entropy signals at both the semantic and token levels to improve reasoning.<n>By jointly optimizing data organization and algorithmic design, our method effectively mitigates entropy collapse and enhances reasoning.
arXiv Detail & Related papers (2025-12-04T01:09:17Z) - Eliciting Chain-of-Thought in Base LLMs via Gradient-Based Representation Optimization [22.301471821413816]
Chain-of-Thought (CoT) reasoning is a critical capability for large language models (LLMs)<n>We propose a novel approach for elic- iting CoT reasoning from base LLMs through hidden state manipulation grounded in conditional generation.
arXiv Detail & Related papers (2025-11-24T13:55:57Z) - VCORE: Variance-Controlled Optimization-based Reweighting for Chain-of-Thought Supervision [9.028503801131933]
We introduce textbfVariance-textbfControlled textbfOptimization-based textbfREweighting (VCORE)<n>By adopting an optimization-theoretic perspective, VCORE enables a principled and adaptive allocation of supervision across tokens.<n> Empirical evaluations demonstrate that VCORE consistently outperforms existing token reweighting methods.
arXiv Detail & Related papers (2025-10-31T13:19:24Z) - Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning [50.20267980386502]
We learn a dense, token-level reward model for process supervision directly from expert demonstrations.<n>The learned reasoning reward serves two complementary roles: (i) it provides step-level feedback to optimise a reasoning policy during training; and (ii) it functions at inference as a critic to rerank sampled traces under fixed compute budgets.
arXiv Detail & Related papers (2025-10-02T09:55:26Z) - COPO: Consistency-Aware Policy Optimization [17.328515578426227]
Reinforcement learning has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) in complex problem-solving tasks.<n>Recently, the introduction of DeepSeek R1 has inspired a surge of interest in leveraging rule-based rewards as a low-cost alternative for computing advantage functions and guiding policy optimization.<n>We propose a consistency-aware policy optimization framework that introduces a structured global reward based on outcome consistency.
arXiv Detail & Related papers (2025-08-06T07:05:18Z) - Revisiting LLM Reasoning via Information Bottleneck [57.519119962528166]
Large language models (LLMs) have recently demonstrated remarkable progress in reasoning capabilities through reinforcement learning with verifiable rewards (RLVR)<n>We present a theoretical characterization of LLM reasoning grounded in information bottleneck (IB) principle.<n>We propose IB-aware reasoning optimization (IBRO), a framework that encourages reasoning trajectories to be both informative about the final correct answer and generalizable.
arXiv Detail & Related papers (2025-07-24T13:14:25Z) - Reinforced Latent Reasoning for LLM-based Recommendation [92.56166822197919]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks.<n>Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data.<n>In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning.
arXiv Detail & Related papers (2025-05-25T11:03:45Z) - RaCT: Ranking-aware Chain-of-Thought Optimization for LLMs [30.216174551427443]
Large language models (LLMs) have demonstrated remarkable potential in text reranking tasks.<n> conventional supervised fine-tuning approaches for specializing LLMs in ranking tasks often lead to significant degradation of the models' general-purpose abilities.<n>This paper presents a novel methodology that strategically combines Chain-of-Thought (CoT) prompting techniques with an innovative two-stage training pipeline.
arXiv Detail & Related papers (2024-12-18T23:24:15Z) - Rethinking Chain-of-Thought from the Perspective of Self-Training [10.722453877596998]
Chain-of-thought (CoT) reasoning has emerged as an effective approach for activating latent capabilities in LLMs.<n>We propose a novel CoT framework to improve reasoning performance.<n>Our framework integrates two key components: (i) a task-specific prompt module that optimize the initial reasoning process, and (ii) an adaptive reasoning module that dynamically refines the reasoning process.
arXiv Detail & Related papers (2024-12-14T13:12:50Z) - Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions.
We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training.
As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z) - Probabilistic Constrained Reinforcement Learning with Formal Interpretability [2.990411348977783]
We propose a novel Adaptive Wasserstein Variational Optimization, namely AWaVO, to tackle these interpretability challenges.
Our approach uses formal methods to achieve the interpretability for convergence guarantee, training transparency, and intrinsic decision-interpretation.
In comparison with state-of-theart benchmarks including TRPO-IPO, PCPO and CRPO, we empirically verify that AWaVO offers a reasonable trade-off between high performance and sufficient interpretability.
arXiv Detail & Related papers (2023-07-13T22:52:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.