Related papers: Preserving Localized Patch Semantics in VLMs

Preserving Localized Patch Semantics in VLMs

URL: http://arxiv.org/abs/2602.01530v1
Date: Mon, 02 Feb 2026 01:48:11 GMT
Title: Preserving Localized Patch Semantics in VLMs
Authors: Parsa Esmaeilkhani, Longin Jan Latecki,
Abstract summary: We introduce a loss to next-token prediction (NTP) to prevent the visual tokens from losing the visual representation inherited from corresponding image patches.<n>LLL constrains the mixing of image and text tokens in the self-attention layers in order to prevent image tokens from losing their localized visual information.<n>As our experiments show, LLL not only makes Logit Lens practically relevant by producing meaningful object confidence maps in images, but also improves performance on vision-centric tasks like segmentation without attaching any special heads.
Score: 8.586228101739259
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Logit Lens has been proposed for visualizing tokens that contribute most to LLM answers. Recently, Logit Lens was also shown to be applicable in autoregressive Vision-Language Models (VLMs), where it illustrates the conceptual content of image tokens in the form of heatmaps, e.g., which image tokens are likely to depict the concept of cat in a given image. However, the visual content of image tokens often gets diffused to language tokens, and consequently, the locality of visual information gets mostly destroyed, which renders Logit Lens visualization unusable for explainability. To address this issue, we introduce a complementary loss to next-token prediction (NTP) to prevent the visual tokens from losing the visual representation inherited from corresponding image patches. The proposed Logit Lens Loss (LLL) is designed to make visual token embeddings more semantically aligned with the textual concepts that describe their image regions (e.g., patches containing a cat with the word "cat"), without requiring any architectural modification or large-scale training. This way, LLL constrains the mixing of image and text tokens in the self-attention layers in order to prevent image tokens from losing their localized visual information. As our experiments show, LLL not only makes Logit Lens practically relevant by producing meaningful object confidence maps in images, but also improves performance on vision-centric tasks like segmentation without attaching any special heads.

Related papers

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs [40.11215282864732]
We introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language.<n>We evaluate this method on 10 different Vision-Language Model (VLM) models.<n>We show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans.
arXiv Detail & Related papers (2026-01-31T02:33:07Z)
Direct Visual Grounding by Directing Attention of Visual Tokens [8.586228101739259]
Vision Language Models (VLMs) mix visual tokens and text tokens.<n>It appears that the standard next-token prediction (NTP) loss provides an insufficient signal for directing attention to visual tokens.<n>We propose a novel loss function that directly supervises the attention of visual tokens.
arXiv Detail & Related papers (2025-11-16T19:09:21Z)
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z)
Image Tokens Matter: Mitigating Hallucination in Discrete Tokenizer-based Large Vision-Language Models via Latent Editing [39.969451863788464]
Large Vision-Language Models (LVLMs) unify multimodal representations by encoding visual inputs into a finite set of tokens.<n>We find that these models still hallucinate non-existent objects.<n>We propose a hallucination mitigation method that suppresses the influence of visually absent tokens by modifying latent image embeddings during generation.
arXiv Detail & Related papers (2025-05-24T22:36:15Z)
Descriminative-Generative Custom Tokens for Vision-Language Models [101.40245125955306]
This paper explores the possibility of learning custom tokens for representing new concepts in Vision-Language Models (VLMs)<n>Our aim is to learn tokens that can be effective for both discriminative and generative tasks while composing well with words to form new input queries.
arXiv Detail & Related papers (2025-02-17T18:13:42Z)
PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model [0.0]
hallucinations often arise from the progressive weakening of attention weight to visual tokens.<n>textbfPAINT (textbfPaying textbfAttention to textbfINformed textbfTokens) is a plug-and-play framework that intervenes in the self-attention mechanism of the Large Vision Language Models.
arXiv Detail & Related papers (2025-01-21T15:22:31Z)
Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning [40.08368469646114]
We provide semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework.<n>We demonstrate notable improvements over ViTs in learned representation quality across text-to-image and image-to-text retrieval tasks.
arXiv Detail & Related papers (2024-05-26T01:46:22Z)
Auto-Encoding Morph-Tokens for Multimodal LLM [151.2618346912529]
We propose encoding images into morph-tokens to serve a dual purpose: for comprehension, they act as visual prompts instructing MLLM to generate texts. Experiments show that morph-tokens can achieve a new SOTA for multimodal comprehension and generation simultaneously.
arXiv Detail & Related papers (2024-05-03T08:43:06Z)
Subobject-level Image Tokenization [60.80949852899857]
Patch-based image tokenization ignores the morphology of the visual world.<n>Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation.<n>We show that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens.
arXiv Detail & Related papers (2024-02-22T06:47:44Z)
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language. The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image. This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z)
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? [89.17394772676819]
We introduce a novel visual representation learning which relies on a handful of adaptively learned tokens. Our experiments demonstrate strong performance on several challenging benchmarks for both image and video recognition tasks.
arXiv Detail & Related papers (2021-06-21T17:55:59Z)
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.