Related papers: Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding

Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding

URL: http://arxiv.org/abs/2503.06287v1
Date: Sat, 08 Mar 2025 17:24:42 GMT
Title: Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding
Authors: Seil Kang, Jinyeong Kim, Junhyeok Kim, Seong Jae Hwang,
Abstract summary: Visual grounding seeks to localize the image region corresponding to a free-form text description.<n>We introduce a training-free visual grounding framework that utilizes text-to-image attention maps to identify the target objects.<n>Our findings suggest that LVLMs can innately ground objects based on a deep comprehension of the text-image relationship.
Score: 4.024850952459758
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual grounding seeks to localize the image region corresponding to a free-form text description. Recently, the strong multimodal capabilities of Large Vision-Language Models (LVLMs) have driven substantial improvements in visual grounding, though they inevitably require fine-tuning and additional model components to explicitly generate bounding boxes or segmentation masks. However, we discover that a few attention heads in frozen LVLMs demonstrate strong visual grounding capabilities. We refer to these heads, which consistently capture object locations related to text semantics, as localization heads. Using localization heads, we introduce a straightforward and effective training-free visual grounding framework that utilizes text-to-image attention maps from localization heads to identify the target objects. Surprisingly, only three out of thousands of attention heads are sufficient to achieve competitive localization performance compared to existing LVLM-based visual grounding methods that require fine-tuning. Our findings suggest that LVLMs can innately ground objects based on a deep comprehension of the text-image relationship, as they implicitly focus on relevant image regions to generate informative text outputs. All the source codes will be made available to the public.

Related papers

Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks. Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context. This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
Locality Alignment Improves Vision-Language Models [55.275235524659905]
Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors.<n>Our goal is to resolve this with a vision backbone that effectively captures both local and global image semantics.<n>We propose a new efficient post-training stage for ViTs called locality alignment and a novel fine-tuning procedure called MaskEmbed.
arXiv Detail & Related papers (2024-10-14T21:01:01Z)
Contrastive Localized Language-Image Pre-Training [60.4967533101887]
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations.<n>We propose Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules.<n>CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks.
arXiv Detail & Related papers (2024-10-03T17:56:09Z)
Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references. Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z)
ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling [35.098725056881655]
Large vision language models (LVLMs) have shown unprecedented visual reasoning capabilities. The generated text often suffers from inaccurate grounding in the visual input, resulting in errors such as hallucination of nonexistent scene elements. We introduce a novel framework, ViGoR, that utilizes fine-grained reward modeling to significantly enhance the visual grounding of LVLMs over pre-trained baselines.
arXiv Detail & Related papers (2024-02-09T01:00:14Z)
What does CLIP know about a red circle? Visual prompt engineering for VLMs [116.8806079598019]
We explore the idea of visual prompt engineering for solving computer vision tasks beyond classification by editing in image space instead of text. We show the power of this simple approach by achieving state-of-the-art in zero-shot referring expressions comprehension and strong performance in keypoint localization tasks.
arXiv Detail & Related papers (2023-04-13T17:58:08Z)
Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts. We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z)
Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning [42.29650807349636]
We propose a transformer-based framework for accurate visual grounding. We develop a visual-linguistic verification module to focus the visual features on regions relevant to the textual descriptions. A language-guided feature encoder is also devised to aggregate the visual contexts of the target object to improve the object's distinctiveness.
arXiv Detail & Related papers (2022-04-30T13:48:15Z)
AttnGrounder: Talking to Cars with Attention [6.09170287691728]
We propose a single-stage end-to-end trainable model for the task of visual grounding. Visual grounding aims to localize a specific object in an image based on a given natural language text query. We evaluate AttnGrounder on the Talk2Car dataset and show an improvement of 3.26% over the existing methods.
arXiv Detail & Related papers (2020-09-11T23:18:55Z)
Learning Visual Representations with Caption Annotations [19.24013129952071]
We propose a proxy task to learn visual representations over image-caption pairs. ICMLM consists in predicting masked words in captions by relying on visual cues. Our experiments confirm that image captions can be leveraged to inject global and localized semantic information into visual representations.
arXiv Detail & Related papers (2020-08-04T08:04:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.