Decoupling Reasoning and Perception: An LLM-LMM Framework for Faithful Visual Reasoning
- URL: http://arxiv.org/abs/2509.23322v1
- Date: Sat, 27 Sep 2025 14:13:41 GMT
- Title: Decoupling Reasoning and Perception: An LLM-LMM Framework for Faithful Visual Reasoning
- Authors: Hongrui Jia, Chaoya Jiang, Shikun Zhang, Wei Ye,
- Abstract summary: We introduce a training-free visual-reasoning pipeline for Large Language Models (LLMs)<n>A powerful LLM orchestrates the high-level reasoning, strategically interrogating a LMM to extract specific visual information required for its logical chain.<n>Our framework effectively governs the visual reasoning process, leading to a significant reduction in visually-unfounded reasoning steps and a substantial improvement in reasoning fidelity.
- Score: 34.940968264459805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Significant advancements in the reasoning capabilities of Large Language Models (LLMs) are now driven by test-time scaling laws, particularly those leveraging extended Chain-of-Thought (CoT) reasoning. Inspired by these breakthroughs, researchers have extended these paradigms to Large Multimodal Models (LMMs). However, a critical limitation emerges: as their reasoning chains extend, LMMs increasingly rely on textual logic, progressively losing grounding in the underlying visual information. This leads to reasoning paths that diverge from the image content, culminating in erroneous conclusions. To address this, we introduce a strikingly simple yet effective training-free visual-reasoning pipeline. The core concept is to decouple the reasoning and perception processes. A powerful LLM orchestrates the high-level reasoning, strategically interrogating a LMM to extract specific visual information required for its logical chain. The LMM, in turn, functions exclusively as a visual question-answering engine, supplying the necessary perceptual details on demand. This lightweight, plug-and-play approach requires no additional training or architectural changes. Comprehensive evaluations validate that our framework effectively governs the visual reasoning process, leading to a significant reduction in visually-unfounded reasoning steps and a substantial improvement in reasoning fidelity.
Related papers
- See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs [24.90876091319589]
We present an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning.<n>Our key idea is to supervise each reasoning step at test time with visual evidence.<n>Our method achieves 16.5%-29.5% improvements on TreeBench and 13.7% RH-AUC gains on RH-Bench.
arXiv Detail & Related papers (2026-02-25T02:13:59Z) - Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs [55.61018839017648]
Chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks.<n>Existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies.<n>We propose SAYO, a visual reasoning model trained with a reinforcement learning framework that introduces a region-level visual attention-based reward.
arXiv Detail & Related papers (2026-02-09T03:33:23Z) - See, Think, Learn: A Self-Taught Multimodal Reasoner [3.443084677278651]
We propose a simple yet effective self-training framework called See-Think-Learn.<n>At its core, STL introduces a structured reasoning template that encourages the model to see before thinking.<n>We augment the training data with negative rationales to enhance the model's ability to distinguish between correct and misleading responses.
arXiv Detail & Related papers (2025-12-02T06:30:10Z) - From <Answer> to <Think>: Multidimensional Supervision of Reasoning Process for LLM Optimization [62.07990937720985]
Dimension-level Reward Model (DRM) is a new supervision framework for Large Language Models.<n>DRM evaluates the quality of a reasoning process along three fundamental, complementary, and interpretable dimensions.<n> Experimental results show that DRM provides effective supervision signals, guides the optimization of LLMs and enhances their reasoning ability.
arXiv Detail & Related papers (2025-10-13T14:29:15Z) - Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning [78.17782197231325]
We propose a reasoning-guided reinforcement learning strategy that aligns the extractor's captioning behavior with the reasoning objective.<n> Experiments on multi-modal math and science benchmarks show that the proposed RACRO method achieves state-of-the-art average performance.
arXiv Detail & Related papers (2025-06-05T02:28:07Z) - Visualizing Thought: Conceptual Diagrams Enable Robust Planning in LMMs [59.66595230543127]
Conceptual diagrams externalize mental models, abstracting irrelevant details to efficiently capture how entities interact.<n>Large Language Models (LLMs) and Large MultiModal Models (LMMs) predominantly reason through text.<n>We propose Visual Thinking, a generalizable framework that enables LMMs to reason through multiple chains of self-generated conceptual diagrams.
arXiv Detail & Related papers (2025-03-14T18:27:02Z) - Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z) - Make LLMs better zero-shot reasoners: Structure-orientated autonomous reasoning [52.83539473110143]
We introduce a novel structure-oriented analysis method to help Large Language Models (LLMs) better understand a question.
To further improve the reliability in complex question-answering tasks, we propose a multi-agent reasoning system, Structure-oriented Autonomous Reasoning Agents (SARA)
Extensive experiments verify the effectiveness of the proposed reasoning system. Surprisingly, in some cases, the system even surpasses few-shot methods.
arXiv Detail & Related papers (2024-10-18T05:30:33Z) - ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom [59.92786855289658]
We introduce a novel visual reasoning framework named ProReason.<n>ProReason features decoupled vision-reasoning capabilities and multi-run proactive perception.<n>Our experiments demonstrate that ProReason outperforms existing multi-step reasoning frameworks on various benchmarks.
arXiv Detail & Related papers (2024-10-18T03:22:06Z) - FiDeLiS: Faithful Reasoning in Large Language Model for Knowledge Graph Question Answering [46.41364317172677]
Large Language Models (LLMs) are often challenged by generating erroneous or hallucinated responses.<n>We propose a unified framework, FiDeLiS, designed to improve the factuality of LLM responses by anchoring answers to verifiable reasoning steps retrieved from Knowledge Graphs.<n>Our method, as a training-free framework, not only improve the performance but also enhance the factuality and interpretability across different benchmarks.
arXiv Detail & Related papers (2024-05-22T17:56:53Z) - Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks.
We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture.
Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z) - Concise and Organized Perception Facilitates Reasoning in Large Language Models [31.238220405009617]
Exploiting large language models (LLMs) to tackle reasoning has garnered growing attention.<n>It still remains highly challenging to achieve satisfactory results in complex logical problems, characterized by plenty of premises within the context and requiring multi-hop reasoning.<n>In this work, we first examine the mechanism from the perspective of information flow and reveal that LLMs confront difficulties akin to human-like cognitive biases when dealing with disordered and irrelevant content in reasoning tasks.
arXiv Detail & Related papers (2023-10-05T04:47:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.