Related papers: Register and CLS tokens yield a decoupling of local and global features in large ViTs

Register and CLS tokens yield a decoupling of local and global features in large ViTs

URL: http://arxiv.org/abs/2505.05892v1
Date: Fri, 09 May 2025 09:00:17 GMT
Title: Register and CLS tokens yield a decoupling of local and global features in large ViTs
Authors: Alexander Lappe, Martin A. Giese,
Abstract summary: We study the influence of register tokens on the relationship between global and local image features.<n>We show that the CLS token itself, which can be interpreted as a register, leads to a very similar phenomenon in models without explicit register tokens.
Score: 49.40323406667405
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent work has shown that the attention maps of the widely popular DINOv2 model exhibit artifacts, which hurt both model interpretability and performance on dense image tasks. These artifacts emerge due to the model repurposing patch tokens with redundant local information for the storage of global image information. To address this problem, additional register tokens have been incorporated in which the model can store such information instead. We carefully examine the influence of these register tokens on the relationship between global and local image features, showing that while register tokens yield cleaner attention maps, these maps do not accurately reflect the integration of local image information in large models. Instead, global information is dominated by information extracted from register tokens, leading to a disconnect between local and global features. Inspired by these findings, we show that the CLS token itself, which can be interpreted as a register, leads to a very similar phenomenon in models without explicit register tokens. Our work shows that care must be taken when interpreting attention maps of large ViTs. Further, by clearly attributing the faulty behaviour to register and CLS tokens, we show a path towards more interpretable vision models.

Related papers

Vision Transformers Don't Need Trained Registers [17.412430704896455]
A sparse set of neurons is responsible for concentrating high-norm activations on outlier tokens.<n>We create a training-free approach to mitigate these artifacts.<n>Our results suggest that test-time registers effectively take on the role of register tokens at test-time.
arXiv Detail & Related papers (2025-06-09T17:59:57Z)
PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model [0.0]
hallucinations often arise from the progressive weakening of attention weight to visual tokens.<n>textbfPAINT (textbfPaying textbfAttention to textbfINformed textbfTokens) is a plug-and-play framework that intervenes in the self-attention mechanism of the Large Vision Language Models.
arXiv Detail & Related papers (2025-01-21T15:22:31Z)
Quantised Global Autoencoder: A Holistic Approach to Representing Visual Data [7.152103069753289]
In quantised autoencoders, images are usually split into local patches, each encoded by one token. Our method is inspired by spectral decompositions which transform an input signal into a superposition of global frequencies.
arXiv Detail & Related papers (2024-07-16T17:05:20Z)
Register assisted aggregation for Visual Place Recognition [4.5476780843439535]
Visual Place Recognition (VPR) refers to the process of using computer vision to recognize the position of the current query image. Previous methods often discarded useless features while uncontrolled discarding features that help improve recognition accuracy. We propose a new feature aggregation method to address this issue. Specifically, in order to obtain global and local features that contain discriminative place information, we added some registers on top of the original image tokens.
arXiv Detail & Related papers (2024-05-19T11:36:52Z)
Identifying and Analyzing Performance-Critical Tokens in Large Language Models [52.404072802235234]
We study how large language models learn to perform tasks from demonstrations.<n>Our work sheds light on how large language models learn to perform tasks from demonstrations and deepens our understanding of the roles different types of tokens play in large language models.
arXiv Detail & Related papers (2024-01-20T20:55:21Z)
L2G: A Simple Local-to-Global Knowledge Transfer Framework for Weakly Supervised Semantic Segmentation [67.26984058377435]
We present L2G, a simple online local-to-global knowledge transfer framework for high-quality object attention mining. Our framework conducts the global network to learn the captured rich object detail knowledge from a global view. Experiments show that our method attains 72.1% and 44.2% mIoU scores on the validation set of PASCAL VOC 2012 and MS COCO 2014.
arXiv Detail & Related papers (2022-04-07T04:31:32Z)
Attribute Prototype Network for Any-Shot Learning [113.50220968583353]
We argue that an image representation with integrated attribute localization ability would be beneficial for any-shot, i.e. zero-shot and few-shot, image classification tasks. We propose a novel representation learning framework that jointly learns global and local features using only class-level attributes.
arXiv Detail & Related papers (2022-04-04T02:25:40Z)
TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization [112.46381729542658]
Weakly supervised object localization (WSOL) is a challenging problem when given image category labels. We introduce the token semantic coupled attention map (TS-CAM) to take full advantage of the self-attention mechanism in visual transformer for long-range dependency extraction.
arXiv Detail & Related papers (2021-03-27T09:43:16Z)
On the Importance of Local Information in Transformer Based Models [19.036044858449593]
The self-attention module is a key component of Transformer-based models. Recent studies have shown that these heads exhibit syntactic, semantic, or local behaviour. We show that a larger fraction of heads have a locality bias as compared to a syntactic bias.
arXiv Detail & Related papers (2020-08-13T11:32:47Z)
Transferring Cross-domain Knowledge for Video Sign Language Recognition [103.9216648495958]
Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation. We propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them.
arXiv Detail & Related papers (2020-03-08T03:05:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.