Beyond Static Cropping: Layer-Adaptive Visual Localization and Decoding Enhancement
- URL: http://arxiv.org/abs/2602.04304v1
- Date: Wed, 04 Feb 2026 08:13:01 GMT
- Title: Beyond Static Cropping: Layer-Adaptive Visual Localization and Decoding Enhancement
- Authors: Zipeng Zhu, Zhanghao Hu, Qinglin Zhu, Yuxi Hong, Yijun Liu, Jingyong Su, Yulan He, Lin Gui,
- Abstract summary: Large Vision-Language Models (LVLMs) have advanced rapidly by aligning visual patches with the text embedding space, but a fixed visual-token budget forces images to be resized to a uniform pretraining resolution.<n>Recent attention-guided enhancement (e.g., cropping or region-focused attention allocation) alleviates this, yet it commonly hinges on a static "magic layer" empirically chosen on simple recognition benchmarks.<n>In contrast to this static assumption, we propose a dynamic perspective on visual grounding.<n>Experiments across diverse VQA benchmarks show that LASER significantly improves VQA accuracy across tasks with varying levels of complexity.
- Score: 30.12584783649903
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Vision-Language Models (LVLMs) have advanced rapidly by aligning visual patches with the text embedding space, but a fixed visual-token budget forces images to be resized to a uniform pretraining resolution, often erasing fine-grained details and causing hallucinations via over-reliance on language priors. Recent attention-guided enhancement (e.g., cropping or region-focused attention allocation) alleviates this, yet it commonly hinges on a static "magic layer" empirically chosen on simple recognition benchmarks and thus may not transfer to complex reasoning tasks. In contrast to this static assumption, we propose a dynamic perspective on visual grounding. Through a layer-wise sensitivity analysis, we demonstrate that visual grounding is a dynamic process: while simple object recognition tasks rely on middle layers, complex visual search and reasoning tasks require visual information to be reactivated at deeper layers. Based on this observation, we introduce Visual Activation by Query (VAQ), a metric that identifies the layer whose attention map is most relevant to query-specific visual grounding by measuring attention sensitivity to the input query. Building on VAQ, we further propose LASER (Layer-adaptive Attention-guided Selective visual and decoding Enhancement for Reasoning), a training-free inference procedure that adaptively selects task-appropriate layers for visual localization and question answering. Experiments across diverse VQA benchmarks show that LASER significantly improves VQA accuracy across tasks with varying levels of complexity.
Related papers
- Head-Aware Visual Cropping: Enhancing Fine-Grained VQA with Attention-Guided Subimage [4.771792258699647]
We propose textbfHead Aware Visual Cropping (HAVC), a training-free method that improves visual grounding by leveraging a selectively refined subset of attention heads.<n> experiments on multiple fine-grained VQA benchmarks demonstrate that HAVC consistently outperforms state-of-the-art cropping strategies.
arXiv Detail & Related papers (2026-01-30T02:46:55Z) - Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning [79.34909830834464]
Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments.<n>We show that visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance.<n>We propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level.
arXiv Detail & Related papers (2025-09-08T09:20:04Z) - CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models [60.0300765815417]
Large Vision-Language Models (LVLMs) frequently produce content that deviates from visual information, leading to object hallucination.<n>We propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method.
arXiv Detail & Related papers (2025-06-30T07:52:36Z) - Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs [9.406760867809124]
This paper introduces Visual Input Structure for Enhanced Reasoning (VISER)<n>VISER is a simple, effective method that augments visual inputs with low-level spatial structures.<n>We empirically demonstrate substantial performance improvements across core visual reasoning tasks.
arXiv Detail & Related papers (2025-06-27T11:44:40Z) - Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering [11.271123465926301]
Multimodal large language models (MLLMs) still struggle with complex reasoning tasks in Visual Question Answering.<n>We propose FOCUS, a plug-and-play approach that dynamically adapts to the complexity of questions.<n>Experiments on four benchmarks, ScienceQA, TextQA, VizWiz, and MME, demonstrate that FOCUS consistently improves the performance of both open-source and black-box MLLMs.
arXiv Detail & Related papers (2025-06-01T03:15:29Z) - Mitigating Visual Knowledge Forgetting in MLLM Instruction-tuning via Modality-decoupled Gradient Descent [72.1517476116743]
Recent MLLMs have shown emerging visual understanding and reasoning abilities after being pre-trained on large-scale multimodal datasets.<n>Existing approaches, such as direct fine-tuning and continual learning methods, fail to explicitly address this issue.<n>We introduce a novel perspective leveraging effective rank to quantify the degradation of visual representation forgetting.<n>We propose a modality-decoupled gradient descent (MDGD) method that regulates gradient updates to maintain the effective rank of visual representations.
arXiv Detail & Related papers (2025-02-17T12:26:34Z) - Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering [10.505845766495128]
Multimodal large language models (MLLMs) have made significant progress in integrating visual and textual modalities.<n>We propose a novel framework based on multimodal retrieval-augmented generation (RAG)<n>RAG introduces structured scene graphs to enhance object recognition, relationship identification, and spatial understanding within images.
arXiv Detail & Related papers (2024-12-30T13:16:08Z) - Top-Down Visual Attention from Analysis by Synthesis [87.47527557366593]
We consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision.
We propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and controllable achieves top-down attention.
arXiv Detail & Related papers (2023-03-23T05:17:05Z) - Visual Perturbation-aware Collaborative Learning for Overcoming the
Language Prior Problem [60.0878532426877]
We propose a novel collaborative learning scheme from the viewpoint of visual perturbation calibration.
Specifically, we devise a visual controller to construct two sorts of curated images with different perturbation extents.
The experimental results on two diagnostic VQA-CP benchmark datasets evidently demonstrate its effectiveness.
arXiv Detail & Related papers (2022-07-24T23:50:52Z) - Weakly Supervised Grounding for VQA in Vision-Language Transformers [112.5344267669495]
This paper focuses on the problem of weakly supervised grounding in context of visual question answering in transformers.
The approach leverages capsules by grouping each visual token in the visual encoder.
We evaluate our approach on the challenging GQA as well as VQA-HAT dataset for VQA grounding.
arXiv Detail & Related papers (2022-07-05T22:06:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.