Related papers: ConFoThinking: Consolidated Focused Attention Driven Thinking for Visual Question Answering

ConFoThinking: Consolidated Focused Attention Driven Thinking for Visual Question Answering

URL: http://arxiv.org/abs/2603.00165v1
Date: Thu, 26 Feb 2026 06:28:43 GMT
Title: ConFoThinking: Consolidated Focused Attention Driven Thinking for Visual Question Answering
Authors: Zhaodong Wu, Haochen Xue, Qi Cao, Wenqi Mo, Yu Pei, Wenqi Xu, Jionglong Su, Yang Liu,
Abstract summary: ConFoThinking learns to aggregate attention into a designated intermediate layer, from which we mine and zoom in salient regions for downstream visual understanding.<n> Experiments across five VQA benchmarks demonstrate ConFoThinking significantly improves perception performance.
Score: 10.689628202869635
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Thinking with Images improves fine-grained VQA for MLLMs by emphasizing visual cues. However, tool-augmented methods depend on the capacity of grounding, which remains unreliable for MLLMs. In parallel, attention-driven methods to crop the Region of Interest (ROIs) are proposed but they are constrained by (1) fragmented attention signals scattered across layers, leading to suboptimal localization and (2) relying on question- or redundant-text-conditioned attention extraction. Our analysis reveals three patterns: MLLMs may attend to the correct region yet generate incorrect coordinates, where-to-look attention is often fragmented across layers, and attention extraction is query-sensitive. Motivated by these, We propose ConFoThinking, a Consolidated-Focused-Attention-Driven Thinking framework that learns to aggregate attention into a designated intermediate layer, from which we mine and zoom in salient regions for downstream visual understanding. Moreover, we extract attention using concise semantic cues of what to look into, which mitigates the semantic noise introduced by question- or redundant-text-based attention extraction. Experiments across five VQA benchmarks demonstrate ConFoThinking significantly improves perception performance. The code, checkpoints, and dataset will be released after being accepted.

Related papers

Beyond Static Cropping: Layer-Adaptive Visual Localization and Decoding Enhancement [30.12584783649903]
Large Vision-Language Models (LVLMs) have advanced rapidly by aligning visual patches with the text embedding space, but a fixed visual-token budget forces images to be resized to a uniform pretraining resolution.<n>Recent attention-guided enhancement (e.g., cropping or region-focused attention allocation) alleviates this, yet it commonly hinges on a static "magic layer" empirically chosen on simple recognition benchmarks.<n>In contrast to this static assumption, we propose a dynamic perspective on visual grounding.<n>Experiments across diverse VQA benchmarks show that LASER significantly improves VQA accuracy across tasks with varying levels of complexity.
arXiv Detail & Related papers (2026-02-04T08:13:01Z)
Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning [79.34909830834464]
Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments.<n>We show that visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance.<n>We propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level.
arXiv Detail & Related papers (2025-09-08T09:20:04Z)
Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z)
OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval [59.377821673653436]
Composed Image Retrieval (CIR) is capable of expressing users' intricate retrieval requirements flexibly.<n>CIR remains in its nascent stages due to two limitations: 1) inhomogeneity between dominant and noisy portions in visual data is ignored, leading to query feature degradation.<n>This work presents a focus mapping-based feature extractor, which consists of two modules: dominant portion segmentation and dual focus mapping.
arXiv Detail & Related papers (2025-07-08T03:27:46Z)
Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas [69.56484419619919]
We study the spatial reasoning challenge from the lens of mechanistic interpretability.<n>We observe that successful spatial reasoning correlates strongly with the model's ability to align its attention with actual object locations.<n>Motivated by these findings, we propose ADAPTVIS to sharpen the attention on highly relevant regions when confident.
arXiv Detail & Related papers (2025-03-03T17:57:03Z)
ResCLIP: Residual Attention for Training-free Dense Vision-language Inference [27.551367463011008]
Cross-correlation of self-attention in CLIP's non-final layers also exhibits localization properties. We propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block. The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference.
arXiv Detail & Related papers (2024-11-24T14:14:14Z)
Attention Overlap Is Responsible for The Entity Missing Problem in Text-to-image Diffusion Models! [3.355491272942994]
This study examines three potential causes of the entity-missing problem, focusing on cross-attention dynamics.<n>We found that reducing overlap in attention maps between entities can effectively minimize the rate of entity missing.
arXiv Detail & Related papers (2024-10-28T12:43:48Z)
Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual Recognition [57.08108545219043]
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision. Existing literature addresses this challenge by employing local-based representation approaches. This article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition.
arXiv Detail & Related papers (2023-05-12T00:13:17Z)
Self-supervised Implicit Glyph Attention for Text Recognition [52.68772018871633]
We propose a novel attention mechanism for scene text recognition (STR) methods, self-supervised implicit glyph attention (SIGA) SIGA delineates the glyph structures of text images by jointly self-supervised text segmentation and implicit attention alignment. Experimental results demonstrate that SIGA performs consistently and significantly better than previous attention-based STR methods.
arXiv Detail & Related papers (2022-03-07T13:40:33Z)
Semantic Reinforced Attention Learning for Visual Place Recognition [15.84086970453363]
Large-scale visual place recognition (VPR) is inherently challenging because not all visual cues in the image are beneficial to the task. We propose a novel Semantic Reinforced Attention Learning Network (SRALNet), in which the inferred attention can benefit from both semantic priors and data-driven fine-tuning. Experiments demonstrate that our method outperforms state-of-the-art techniques on city-scale VPR benchmark datasets.
arXiv Detail & Related papers (2021-08-19T02:14:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.