Related papers: Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks

Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks

URL: http://arxiv.org/abs/2603.04676v1
Date: Wed, 04 Mar 2026 23:34:39 GMT
Title: Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks
Authors: Chenjun Li,
Abstract summary: We investigate a previously overlooked phenomenon: during chain-of-thought (CoT) generation, the text-to-image (T2I) attention of reasoning exhibits diffuse "pulses"<n>We propose PulseFocus, a training-free, inference-time method that structures CoT reasoning into interleaved plan/focus blocks with soft attention gating.
Score: 0.006813985320936554
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-image reasoning remains a significant challenge for vision-language models (VLMs). We investigate a previously overlooked phenomenon: during chain-of-thought (CoT) generation, the text-to-image (T2I) attention of reasoning VLMs exhibits diffuse "pulses": sporadic and unfocused attention patterns that fail to concentrate on task-relevant images. We further reveal a systematic positional bias in attention allocation across images. Motivated by these observations, we propose PulseFocus, a training-free, inference-time method that structures CoT reasoning into interleaved plan/focus blocks with soft attention gating. By forcing the model to explicitly plan which image to examine and then gating decode-time attention to the referenced image, PulseFocus sharpens attention focus and yields consistent improvements on multi-image benchmarks like BLINK benchmark (+3.7%) and MuirBench (+1.07%).

Related papers

ConFoThinking: Consolidated Focused Attention Driven Thinking for Visual Question Answering [10.689628202869635]
ConFoThinking learns to aggregate attention into a designated intermediate layer, from which we mine and zoom in salient regions for downstream visual understanding.<n> Experiments across five VQA benchmarks demonstrate ConFoThinking significantly improves perception performance.
arXiv Detail & Related papers (2026-02-26T06:28:43Z)
Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs [55.61018839017648]
Chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks.<n>Existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies.<n>We propose SAYO, a visual reasoning model trained with a reinforcement learning framework that introduces a region-level visual attention-based reward.
arXiv Detail & Related papers (2026-02-09T03:33:23Z)
More Images, More Problems? A Controlled Analysis of VLM Failure Modes [80.64323947730905]
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored.<n>We introduce MIMIC, a new benchmark designed to rigorously evaluate the multi-image capabilities of LVLMs.
arXiv Detail & Related papers (2026-01-12T18:45:13Z)
CoFFT: Chain of Foresight-Focus Thought for Visual Language Models [61.34272727005052]
Chain of Foresight-Focus Thought (CoFFT) is a training-free approach that enhances visual reasoning by emulating human visual cognition.<n>These stages function iteratively, creating an interdependent cycle where reasoning guides visual focus and visual focus informs subsequent reasoning.<n> Empirical results across multiple benchmarks using Qwen2.5-VL, InternVL-2.5, and Llava-Next demonstrate consistent performance improvements of 3.1-5.8% with controllable increasing computational overhead.
arXiv Detail & Related papers (2025-09-26T07:46:30Z)
Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning [79.34909830834464]
Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments.<n>We show that visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance.<n>We propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level.
arXiv Detail & Related papers (2025-09-08T09:20:04Z)
Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.<n>TVC helps the model retain attention to the visual components throughout the reasoning.<n>Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z)
MINT: Mitigating Hallucinations in Large Vision-Language Models via Token Reduction [6.416957959150438]
Hallucinations hinder the application of Large Vision-Language Models (LVLMs) in domains that require high reliability.<n>We propose MINT, a training-free decoding strategy, MItigating hallucinations via tokeN reducTion.<n>Our approach achieves a 4% improvement in mitigating hallucinations caused by distracted perception compared to original models.
arXiv Detail & Related papers (2025-02-02T08:34:57Z)
Attention Overlap Is Responsible for The Entity Missing Problem in Text-to-image Diffusion Models! [3.355491272942994]
This study examines three potential causes of the entity-missing problem, focusing on cross-attention dynamics.<n>We found that reducing overlap in attention maps between entities can effectively minimize the rate of entity missing.
arXiv Detail & Related papers (2024-10-28T12:43:48Z)
More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints [63.08768589044052]
We propose Contrastive Content Re-sourcing ( CCR) and Contrastive Content Swapping ( CCS) constraints to address such limitation. CCR and CCS constraints supervise the training of attention models in a contrastive learning manner without requiring explicit attention annotations. Experiments on both Flickr30k and MS-COCO datasets demonstrate that integrating these attention constraints into two state-of-the-art attention-based models improves the model performance.
arXiv Detail & Related papers (2021-05-20T08:48:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.