Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks
- URL: http://arxiv.org/abs/2603.04676v1
- Date: Wed, 04 Mar 2026 23:34:39 GMT
- Title: Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks
- Authors: Chenjun Li,
- Abstract summary: We investigate a previously overlooked phenomenon: during chain-of-thought (CoT) generation, the text-to-image (T2I) attention of reasoning exhibits diffuse "pulses"<n>We propose PulseFocus, a training-free, inference-time method that structures CoT reasoning into interleaved plan/focus blocks with soft attention gating.
- Score: 0.006813985320936554
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-image reasoning remains a significant challenge for vision-language models (VLMs). We investigate a previously overlooked phenomenon: during chain-of-thought (CoT) generation, the text-to-image (T2I) attention of reasoning VLMs exhibits diffuse "pulses": sporadic and unfocused attention patterns that fail to concentrate on task-relevant images. We further reveal a systematic positional bias in attention allocation across images. Motivated by these observations, we propose PulseFocus, a training-free, inference-time method that structures CoT reasoning into interleaved plan/focus blocks with soft attention gating. By forcing the model to explicitly plan which image to examine and then gating decode-time attention to the referenced image, PulseFocus sharpens attention focus and yields consistent improvements on multi-image benchmarks like BLINK benchmark (+3.7%) and MuirBench (+1.07%).
Related papers
- ConFoThinking: Consolidated Focused Attention Driven Thinking for Visual Question Answering [10.689628202869635]
ConFoThinking learns to aggregate attention into a designated intermediate layer, from which we mine and zoom in salient regions for downstream visual understanding.<n> Experiments across five VQA benchmarks demonstrate ConFoThinking significantly improves perception performance.
arXiv Detail & Related papers (2026-02-26T06:28:43Z) - Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs [55.61018839017648]
Chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks.<n>Existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies.<n>We propose SAYO, a visual reasoning model trained with a reinforcement learning framework that introduces a region-level visual attention-based reward.
arXiv Detail & Related papers (2026-02-09T03:33:23Z) - More Images, More Problems? A Controlled Analysis of VLM Failure Modes [80.64323947730905]
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored.<n>We introduce MIMIC, a new benchmark designed to rigorously evaluate the multi-image capabilities of LVLMs.
arXiv Detail & Related papers (2026-01-12T18:45:13Z) - CoFFT: Chain of Foresight-Focus Thought for Visual Language Models [61.34272727005052]
Chain of Foresight-Focus Thought (CoFFT) is a training-free approach that enhances visual reasoning by emulating human visual cognition.<n>These stages function iteratively, creating an interdependent cycle where reasoning guides visual focus and visual focus informs subsequent reasoning.<n> Empirical results across multiple benchmarks using Qwen2.5-VL, InternVL-2.5, and Llava-Next demonstrate consistent performance improvements of 3.1-5.8% with controllable increasing computational overhead.
arXiv Detail & Related papers (2025-09-26T07:46:30Z) - Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning [79.34909830834464]
Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments.<n>We show that visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance.<n>We propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level.
arXiv Detail & Related papers (2025-09-08T09:20:04Z) - Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.<n>TVC helps the model retain attention to the visual components throughout the reasoning.<n>Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z) - MINT: Mitigating Hallucinations in Large Vision-Language Models via Token Reduction [6.416957959150438]
Hallucinations hinder the application of Large Vision-Language Models (LVLMs) in domains that require high reliability.<n>We propose MINT, a training-free decoding strategy, MItigating hallucinations via tokeN reducTion.<n>Our approach achieves a 4% improvement in mitigating hallucinations caused by distracted perception compared to original models.
arXiv Detail & Related papers (2025-02-02T08:34:57Z) - Attention Overlap Is Responsible for The Entity Missing Problem in Text-to-image Diffusion Models! [3.355491272942994]
This study examines three potential causes of the entity-missing problem, focusing on cross-attention dynamics.<n>We found that reducing overlap in attention maps between entities can effectively minimize the rate of entity missing.
arXiv Detail & Related papers (2024-10-28T12:43:48Z) - More Than Just Attention: Learning Cross-Modal Attentions with
Contrastive Constraints [63.08768589044052]
We propose Contrastive Content Re-sourcing ( CCR) and Contrastive Content Swapping ( CCS) constraints to address such limitation.
CCR and CCS constraints supervise the training of attention models in a contrastive learning manner without requiring explicit attention annotations.
Experiments on both Flickr30k and MS-COCO datasets demonstrate that integrating these attention constraints into two state-of-the-art attention-based models improves the model performance.
arXiv Detail & Related papers (2021-05-20T08:48:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.