Direct Visual Grounding by Directing Attention of Visual Tokens
- URL: http://arxiv.org/abs/2511.12738v1
- Date: Sun, 16 Nov 2025 19:09:21 GMT
- Title: Direct Visual Grounding by Directing Attention of Visual Tokens
- Authors: Parsa Esmaeilkhani, Longin Jan Latecki,
- Abstract summary: Vision Language Models (VLMs) mix visual tokens and text tokens.<n>It appears that the standard next-token prediction (NTP) loss provides an insufficient signal for directing attention to visual tokens.<n>We propose a novel loss function that directly supervises the attention of visual tokens.
- Score: 8.586228101739259
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Language Models (VLMs) mix visual tokens and text tokens. A puzzling issue is the fact that visual tokens most related to the query receive little to no attention in the final layers of the LLM module of VLMs from the answer tokens, where all tokens are treated equally, in particular, visual and language tokens in the LLM attention layers. This fact may result in wrong answers to visual questions, as our experimental results confirm. It appears that the standard next-token prediction (NTP) loss provides an insufficient signal for directing attention to visual tokens. We hypothesize that a more direct supervision of the attention of visual tokens to corresponding language tokens in the LLM module of VLMs will lead to improved performance on visual tasks. To demonstrate that this is indeed the case, we propose a novel loss function that directly supervises the attention of visual tokens. It directly grounds the answer language tokens in images by directing their attention to the relevant visual tokens. This is achieved by aligning the attention distribution of visual tokens to ground truth attention maps with KL divergence. The ground truth attention maps are obtained from task geometry in synthetic cases or from standard grounding annotations (e.g., bounding boxes or point annotations) in real images, and are used inside the LLM for attention supervision without requiring new labels. The obtained KL attention loss (KLAL) when combined with NTP encourages VLMs to attend to relevant visual tokens while generating answer tokens. This results in notable improvements across geometric tasks, pointing, and referring expression comprehension on both synthetic and real-world data, as demonstrated by our experiments. We also introduce a new dataset to evaluate the line tracing abilities of VLMs. Surprisingly, even commercial VLMs do not perform well on this task.
Related papers
- Preserving Localized Patch Semantics in VLMs [8.586228101739259]
We introduce a loss to next-token prediction (NTP) to prevent the visual tokens from losing the visual representation inherited from corresponding image patches.<n>LLL constrains the mixing of image and text tokens in the self-attention layers in order to prevent image tokens from losing their localized visual information.<n>As our experiments show, LLL not only makes Logit Lens practically relevant by producing meaningful object confidence maps in images, but also improves performance on vision-centric tasks like segmentation without attaching any special heads.
arXiv Detail & Related papers (2026-02-02T01:48:11Z) - Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention [50.97683288777336]
Multimodal Large Language Models (MLLMs) suffer from considerable computational overhead due to their reliance on massive visual tokens.<n>Recent studies have explored token pruning to alleviate this problem, which typically uses text-vision cross-attention.<n>We propose HoloV, a plug-and-play visual token pruning framework for efficient inference.
arXiv Detail & Related papers (2025-10-03T11:33:40Z) - PoRe: Position-Reweighted Visual Token Pruning for Vision Language Models [12.189644988996022]
We present an extremely simple yet effective approach to alleviate the recency bias in visual token pruning.<n>We propose a straightforward reweighting mechanism that adjusts the attention scores of visual tokens according to their spatial positions in the image.<n>Our method, termed Position-reweighted Visual Token Pruning, is a plug-and-play solution that can be seamlessly incorporated into existing visual token pruning frameworks.
arXiv Detail & Related papers (2025-08-25T08:56:32Z) - Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment [38.04426918886084]
Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics.<n>Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs)<n>We introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention.
arXiv Detail & Related papers (2025-06-27T14:55:40Z) - Window Token Concatenation for Efficient Visual Large Language Models [59.6094005814282]
We propose Window Token Concatenation (WiCo) to reduce visual tokens in Visual Large Language Models (VLLMs)<n>WiCo group diverse tokens into one, and thus obscure some fine details.<n>We perform extensive experiments on both coarse- and fine-grained visual understanding tasks based on LLaVA-1.5 and Shikra, showing better performance compared with existing token reduction projectors.
arXiv Detail & Related papers (2025-04-05T02:32:58Z) - Introducing Visual Perception Token into Multimodal Large Language Model [53.82301522384719]
Multimodal Large Language Model (MLLM) relies on the perception process of its vision encoder.<n>MLLM still lacks the autonomous capability to control its own visual perception processes.<n>We propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes.
arXiv Detail & Related papers (2025-02-24T18:56:12Z) - PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model [0.0]
hallucinations often arise from the progressive weakening of attention weight to visual tokens.<n>textbfPAINT (textbfPaying textbfAttention to textbfINformed textbfTokens) is a plug-and-play framework that intervenes in the self-attention mechanism of the Large Vision Language Models.
arXiv Detail & Related papers (2025-01-21T15:22:31Z) - [CLS] Token Tells Everything Needed for Training-free Efficient MLLMs [66.5266435598799]
Multi-language Large Language Models (MLLMs) have recently demonstrated strong performance across a wide range of vision tasks.<n>However, their efficient deployment remains a substantial challenge due to high computational costs and memory requirements.<n>We introduce a simple yet effective method for train-free visual compression, called VTC- compression.
arXiv Detail & Related papers (2024-12-08T05:29:39Z) - Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs [34.3615740255575]
Large vision-language models (LVLMs) generally contain significantly more visual tokens than their textual counterparts.<n>We propose VisPruner, a plug-and-play method that utilizes visual cues for more effective token pruning in LVLMs.<n>Our results show that VisPruner can reduce the FLOPs of LLaVA-1.5-7B by 91% and inference latency by 75%, while maintaining comparable performance.
arXiv Detail & Related papers (2024-12-02T18:57:40Z) - Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [66.04061083611863]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation.<n>We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE)<n>DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z) - Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models [16.185253476874006]
Large Vision Language Models (LVLMs) demonstrate strong capabilities in visual understanding and description, yet often suffer from hallucinations.<n>We propose Attentional Vision (AvisC), a test-time approach that recalibrates the influence of blind tokens without modifying the underlying attention mechanism.<n>Experiments on standard benchmarks, including POPE, MME, and AMBER, demonstrate that AvisC effectively reduces hallucinations in LVLMs.
arXiv Detail & Related papers (2024-05-28T04:40:57Z) - Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference [59.91176945361035]
We introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference.<n>VTW strategically withdraws vision tokens at a certain layer, enabling only text tokens to engage in subsequent layers.<n>Our approach can cut computational overhead by over 40% across diverse multimodal tasks while maintaining performance.
arXiv Detail & Related papers (2024-05-09T14:38:53Z) - Auto-Encoding Morph-Tokens for Multimodal LLM [151.2618346912529]
We propose encoding images into morph-tokens to serve a dual purpose: for comprehension, they act as visual prompts instructing MLLM to generate texts.
Experiments show that morph-tokens can achieve a new SOTA for multimodal comprehension and generation simultaneously.
arXiv Detail & Related papers (2024-05-03T08:43:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.