Related papers: Dissecting Query-Key Interaction in Vision Transformers

Dissecting Query-Key Interaction in Vision Transformers

URL: http://arxiv.org/abs/2405.14880v2
Date: Mon, 27 May 2024 01:31:56 GMT
Title: Dissecting Query-Key Interaction in Vision Transformers
Authors: Xu Pan, Aaron Philip, Ziqian Xie, Odelia Schwartz,
Abstract summary: Self-attention in vision transformers is often thought to perform perceptual grouping where tokens attend to other tokens with similar embeddings. We propose to use the Singular Value Decomposition to dissect the query-key interaction. We find that early layers attend more to similar tokens, while late layers show increased attention to dissimilar tokens.
Score: 4.743574336827573
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Self-attention in vision transformers is often thought to perform perceptual grouping where tokens attend to other tokens with similar embeddings, which could correspond to semantically similar features of an object. However, attending to dissimilar tokens can be beneficial by providing contextual information. We propose to use the Singular Value Decomposition to dissect the query-key interaction (i.e. ${\textbf{W}_q}^\top\textbf{W}_k$). We find that early layers attend more to similar tokens, while late layers show increased attention to dissimilar tokens, providing evidence corresponding to perceptual grouping and contextualization, respectively. Many of these interactions between features represented by singular vectors are interpretable and semantic, such as attention between relevant objects, between parts of an object, or between the foreground and background. This offers a novel perspective on interpreting the attention mechanism, which contributes to understanding how transformer models utilize context and salient features when processing images.

Related papers

KNN Transformer with Pyramid Prompts for Few-Shot Learning [52.735070934075736]
Few-Shot Learning aims to recognize new classes with limited labeled data. Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features.
arXiv Detail & Related papers (2024-10-14T07:39:30Z)
Mining Open Semantics from CLIP: A Relation Transition Perspective for Few-Shot Learning [46.25534556546322]
We propose to mine open semantics as anchors to perform a relation transition from image-anchor relationship to image-target relationship to make predictions. Our method performs favorably against previous state-of-the-arts considering few-shot classification settings.
arXiv Detail & Related papers (2024-06-17T06:28:58Z)
Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning [41.81009725976217]
We provide semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework. We demonstrate notable improvements over ViTs in learned representation quality across text-to-image and image-to-text retrieval tasks.
arXiv Detail & Related papers (2024-05-26T01:46:22Z)
Disentangled Interaction Representation for One-Stage Human-Object Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding. Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction. Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z)
Learning-based Relational Object Matching Across Views [63.63338392484501]
We propose a learning-based approach which combines local keypoints with novel object-level features for matching object detections between RGB images. We train our object-level matching features based on appearance and inter-frame and cross-frame spatial relations between objects in an associative graph neural network.
arXiv Detail & Related papers (2023-05-03T19:36:51Z)
AttEntropy: Segmenting Unknown Objects in Complex Scenes using the Spatial Attention Entropy of Semantic Segmentation Transformers [99.22536338338011]
We study the spatial attentions of different backbone layers of semantic segmentation transformers. We exploit this by extracting heatmaps that can be used to segment unknown objects within diverse backgrounds. Our method is training-free and its computational overhead negligible.
arXiv Detail & Related papers (2022-12-29T18:07:56Z)
Framework-agnostic Semantically-aware Global Reasoning for Segmentation [29.69187816377079]
We propose a component that learns to project image features into latent representations and reason between them. Our design encourages the latent regions to represent semantic concepts by ensuring that the activated regions are spatially disjoint. Our latent tokens are semantically interpretable and diverse and provide a rich set of features that can be transferred to downstream tasks.
arXiv Detail & Related papers (2022-12-06T21:42:05Z)
Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts. We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query. Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z)
KVT: k-NN Attention for Boosting Vision Transformers [44.189475770152185]
We propose a sparse attention scheme, dubbed k-NN attention, for boosting vision transformers. The proposed k-NN attention naturally inherits the local bias of CNNs without introducing convolutional operations. We verify, both theoretically and empirically, that $k$-NN attention is powerful in distilling noise from input tokens and in speeding up training.
arXiv Detail & Related papers (2021-05-28T06:49:10Z)
Context-Aware Interaction Network for Question Matching [51.76812857301819]
We propose a context-aware interaction network (COIN) to align two sequences and infer their semantic relationship. Specifically, each interaction block includes (1) a context-aware cross-attention mechanism to effectively integrate contextual information, and (2) a gate fusion layer to flexibly interpolate aligned representations.
arXiv Detail & Related papers (2021-04-17T05:03:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.