Look-Back: Implicit Visual Re-focusing in MLLM Reasoning
- URL: http://arxiv.org/abs/2507.03019v1
- Date: Wed, 02 Jul 2025 14:59:35 GMT
- Title: Look-Back: Implicit Visual Re-focusing in MLLM Reasoning
- Authors: Shuo Yang, Yuwei Niu, Yuyang Liu, Yang Ye, Bin Lin, Li Yuan,
- Abstract summary: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in multimodal reasoning.<n>Current methods typically address this by explicitly injecting visual information to guide the reasoning process.<n>We introduce Look-Back, an implicit approach designed to guide MLLMs to look back" at visual information in a self-directed manner during reasoning.
- Score: 15.478700750705643
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in multimodal reasoning. However, they often excessively rely on textual information during the later stages of inference, neglecting the crucial integration of visual input. Current methods typically address this by explicitly injecting visual information to guide the reasoning process. In this work, through an analysis of MLLM attention patterns, we made an intriguing observation: with appropriate guidance, MLLMs can spontaneously re-focus their attention on visual inputs during the later stages of reasoning, even without explicit visual information injection. This spontaneous shift in focus suggests that MLLMs are intrinsically capable of performing visual fusion reasoning. Building on this insight, we introduce Look-Back, an implicit approach designed to guide MLLMs to ``look back" at visual information in a self-directed manner during reasoning. Look-Back empowers the model to autonomously determine when, where, and how to re-focus on visual inputs, eliminating the need for explicit model-structure constraints or additional input. We demonstrate that Look-Back significantly enhances the model's reasoning and perception capabilities, as evidenced by extensive empirical evaluations on multiple multimodal benchmarks.
Related papers
- Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [56.76175383189738]
We introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs.<n>ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates.<n> Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern.
arXiv Detail & Related papers (2025-05-24T12:33:52Z) - Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis [21.869968563545736]
We define implicit visual misunderstanding (IVM), where MLLMs provide correct answers without fully comprehending the visual input.<n>We introduce a scale-agnostic metric, textitattention accuracy, and a novel benchmark for quantifying IVMs.<n>We extend our approach to finer granularities and demonstrate its effectiveness in unimodal scenarios.
arXiv Detail & Related papers (2025-05-15T17:52:40Z) - Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.<n>TVC helps the model retain attention to the visual components throughout the reasoning.<n>Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z) - VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity [34.29409506366145]
VERIFY is a benchmark designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs.<n>Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes.<n>We propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns.
arXiv Detail & Related papers (2025-03-14T16:26:11Z) - Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs [7.03771340666549]
Vision-language misalignment in Multimodal Large Language Models (MLLMs) is a critical challenge.<n>We propose MapleLeaf AKI, a novel MLLM that unlocks causal attention into modality-mutual attention (MMA) to enable image tokens to attend to text tokens.<n>Our MMA design is intended to be generic, allowing for application across various modalities, and scalable to accommodate diverse multimodal scenarios.
arXiv Detail & Related papers (2025-03-04T13:18:33Z) - MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs [11.532430076027554]
We study whether MLLMs can perceive small visual details as effectively as large ones when answering questions about images.<n>We propose training-free visual intervention methods that leverage the internal knowledge of any MLLM itself.
arXiv Detail & Related papers (2025-02-24T18:54:40Z) - ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom [41.369481426130186]
We introduce a novel visual reasoning framework named ProReason.<n>ProReason features decoupled vision-reasoning capabilities and multi-run proactive perception.<n>Our experiments demonstrate that ProReason outperforms existing multi-step reasoning frameworks on various benchmarks.
arXiv Detail & Related papers (2024-10-18T03:22:06Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism.
We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.