Head-Aware Visual Cropping: Enhancing Fine-Grained VQA with Attention-Guided Subimage
- URL: http://arxiv.org/abs/2601.22483v1
- Date: Fri, 30 Jan 2026 02:46:55 GMT
- Title: Head-Aware Visual Cropping: Enhancing Fine-Grained VQA with Attention-Guided Subimage
- Authors: Junfei Xie, Peng Pan, Xulong Zhang,
- Abstract summary: We propose textbfHead Aware Visual Cropping (HAVC), a training-free method that improves visual grounding by leveraging a selectively refined subset of attention heads.<n> experiments on multiple fine-grained VQA benchmarks demonstrate that HAVC consistently outperforms state-of-the-art cropping strategies.
- Score: 4.771792258699647
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) show strong performance in Visual Question Answering (VQA) but remain limited in fine-grained reasoning due to low-resolution inputs and noisy attention aggregation. We propose \textbf{Head Aware Visual Cropping (HAVC)}, a training-free method that improves visual grounding by leveraging a selectively refined subset of attention heads. HAVC first filters heads through an OCR-based diagnostic task, ensuring that only those with genuine grounding ability are retained. At inference, these heads are further refined using spatial entropy for stronger spatial concentration and gradient sensitivity for predictive contribution. The fused signals produce a reliable Visual Cropping Guidance Map, which highlights the most task-relevant region and guides the cropping of a subimage subsequently provided to the MLLM together with the image-question pair. Extensive experiments on multiple fine-grained VQA benchmarks demonstrate that HAVC consistently outperforms state-of-the-art cropping strategies, achieving more precise localization, stronger visual grounding, providing a simple yet effective strategy for enhancing precision in MLLMs.
Related papers
- Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation [51.743225614196774]
Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language reasoning.<n>They remain vulnerable to hallucination, where generated content deviates from visual evidence.<n>Recent vision enhancement methods attempt to address this issue by reinforcing visual tokens during decoding.<n>We propose Adaptive Visual Reinforcement (AIR), a training-free framework for MLLMs.
arXiv Detail & Related papers (2026-02-27T14:18:51Z) - Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning [17.81009868725361]
A common practice is to identify key image regions and refer to their high-resolution counterparts during reasoning.<n>It remains an open question how to enhance a model's grounding abilities to support reasoning without relying on additional annotations.<n>We propose High-resolution.<n>free Reasoning Technique (HART), a closed-loop framework that enables LMMs to focus on and self-verify key regions.
arXiv Detail & Related papers (2026-02-27T02:43:35Z) - ConFoThinking: Consolidated Focused Attention Driven Thinking for Visual Question Answering [10.689628202869635]
ConFoThinking learns to aggregate attention into a designated intermediate layer, from which we mine and zoom in salient regions for downstream visual understanding.<n> Experiments across five VQA benchmarks demonstrate ConFoThinking significantly improves perception performance.
arXiv Detail & Related papers (2026-02-26T06:28:43Z) - Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding [78.26501371437013]
Multimodal reasoning for ultra-high-resolution (UHR) remote sensing (RS) is usually bottlenecked by visual evidence acquisition.<n>We find that standard reinforcement learning struggles to navigate these vast visual spaces without structured domain priors.<n>We propose a staged knowledge injection recipe: (1) cold-starting with scalable, knowledge-graph-verified Earth-science text QA to instill reasoning structures; and (2) "pre-warming" on the same hard UHR image-text examples during SFT to stabilize and amplify subsequent tool-based RL.
arXiv Detail & Related papers (2026-02-15T16:40:33Z) - GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery [69.05066425853326]
"thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools.<n>This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny.<n>We propose GeoEyes, a training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom
arXiv Detail & Related papers (2026-02-15T15:50:55Z) - Beyond Static Cropping: Layer-Adaptive Visual Localization and Decoding Enhancement [30.12584783649903]
Large Vision-Language Models (LVLMs) have advanced rapidly by aligning visual patches with the text embedding space, but a fixed visual-token budget forces images to be resized to a uniform pretraining resolution.<n>Recent attention-guided enhancement (e.g., cropping or region-focused attention allocation) alleviates this, yet it commonly hinges on a static "magic layer" empirically chosen on simple recognition benchmarks.<n>In contrast to this static assumption, we propose a dynamic perspective on visual grounding.<n>Experiments across diverse VQA benchmarks show that LASER significantly improves VQA accuracy across tasks with varying levels of complexity.
arXiv Detail & Related papers (2026-02-04T08:13:01Z) - Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided Importance [60.028070589466445]
Pyramid Token Pruning (PTP) is a training-free strategy that hierarchically integrates bottom-up visual saliency at both region and token levels with top-down instruction-guided relevance.<n>We show that PTP substantially reduces computational cost, memory usage, and inference latency, with negligible performance degradation.
arXiv Detail & Related papers (2025-09-19T07:28:17Z) - HRSeg: High-Resolution Visual Perception and Enhancement for Reasoning Segmentation [74.1872891313184]
HRSeg is an efficient model with high-resolution fine-grained perception.<n>It features two key innovations: High-Resolution Perception (HRP) and High-Resolution Enhancement (HRE)
arXiv Detail & Related papers (2025-07-17T08:09:31Z) - Q-Insight: Understanding Image Quality via Visual Reinforcement Learning [27.26829134776367]
Image quality assessment (IQA) focuses on the perceptual visual quality of images, playing a crucial role in downstream tasks such as image reconstruction, compression, and generation.<n>We propose Q-Insight, a reinforcement learning-based model built upon group relative policy optimization (GRPO)<n>We show that Q-Insight substantially outperforms existing state-of-the-art methods in both score regression and degradation perception tasks.
arXiv Detail & Related papers (2025-03-28T17:59:54Z) - Debiasing Multimodal Large Language Models via Penalization of Language Priors [38.97645845493758]
Multimodal Large Language Models (MLLMs) have become indispensable tools in computer vision and natural language processing.<n>Despite their advancements, our investigation reveals a noteworthy bias: the generated content is often driven more by the inherent priors of the underlying Large Language Models (LLMs) than by the input image.<n>We propose two simple, training-free strategies to rectify these biases and redirect the model's focus toward visual information.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - TOPIQ: A Top-down Approach from Semantics to Distortions for Image
Quality Assessment [53.72721476803585]
Image Quality Assessment (IQA) is a fundamental task in computer vision that has witnessed remarkable progress with deep neural networks.
We propose a top-down approach that uses high-level semantics to guide the IQA network to focus on semantically important local distortion regions.
A key component of our approach is the proposed cross-scale attention mechanism, which calculates attention maps for lower level features.
arXiv Detail & Related papers (2023-08-06T09:08:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.