Related papers: A Manifold Representation of the Key in Vision Transformers

A Manifold Representation of the Key in Vision Transformers

URL: http://arxiv.org/abs/2402.00534v2
Date: Fri, 7 Jun 2024 10:41:05 GMT
Title: A Manifold Representation of the Key in Vision Transformers
Authors: Li Meng, Morten Goodwin, Anis Yazidi, Paal Engelstad,
Abstract summary: This paper explores the concept of disentangling the key from the query and value, and adopting a manifold representation for the key. Our experiments reveal that decoupling and endowing the key with a manifold structure can enhance the model's performance.
Score: 8.938418994111716
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Transformers implement multi-head self-attention via stacking multiple attention blocks. The query, key, and value are often intertwined and generated within those blocks via a single, shared linear transformation. This paper explores the concept of disentangling the key from the query and value, and adopting a manifold representation for the key. Our experiments reveal that decoupling and endowing the key with a manifold structure can enhance the model's performance. Specifically, ViT-B exhibits a 0.87% increase in top-1 accuracy, while Swin-T sees a boost of 0.52% in top-1 accuracy on the ImageNet-1K dataset, with eight charts in the manifold key. Our approach also yields positive results in object detection and instance segmentation tasks on the COCO dataset. We establish that these performance gains are not merely due to the simplicity of adding more parameters and computations. Future research may investigate strategies for cutting the budget of such representations and aim for further performance improvements based on our findings.

Related papers

Sharing Key Semantics in Transformer Makes Efficient Image Restoration [148.22790334216117]
Self-attention mechanism, a cornerstone of Vision Transformers (ViTs) tends to encompass all global cues. Small segments of a degraded image, particularly those closely aligned semantically, provide particularly relevant information to aid in the restoration process. We propose boosting IR's performance by sharing the key semantics via Transformer for IR (ie, SemanIR) in this paper.
arXiv Detail & Related papers (2024-05-30T12:45:34Z)
RAT: Retrieval-Augmented Transformer for Click-Through Rate Prediction [68.34355552090103]
This paper develops a Retrieval-Augmented Transformer (RAT), aiming to acquire fine-grained feature interactions within and across samples. We then build Transformer layers with cascaded attention to capture both intra- and cross-sample feature interactions. Experiments on real-world datasets substantiate the effectiveness of RAT and suggest its advantage in long-tail scenarios.
arXiv Detail & Related papers (2024-04-02T19:14:23Z)
Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression [63.23578860867408]
We investigate how to integrate the evaluations of importance and sparsity scores into a single stage. We present OFB, a cost-efficient approach that simultaneously evaluates both importance and sparsity scores. Experiments demonstrate that OFB can achieve superior compression performance over state-of-the-art searching-based and pruning-based methods.
arXiv Detail & Related papers (2024-03-23T13:22:36Z)
Learning Feature Matching via Matchable Keypoint-Assisted Graph Neural Network [52.29330138835208]
Accurately matching local features between a pair of images is a challenging computer vision task. Previous studies typically use attention based graph neural networks (GNNs) with fully-connected graphs over keypoints within/across images. We propose MaKeGNN, a sparse attention-based GNN architecture which bypasses non-repeatable keypoints and leverages matchable ones to guide message passing.
arXiv Detail & Related papers (2023-07-04T02:50:44Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
Rethinking Query-Key Pairwise Interactions in Vision Transformers [5.141895475956681]
We propose key-only attention, which excludes query-key pairwise interactions and uses a compute-efficient saliency-gate to obtain attention weights. We develop a new self-attention model family, LinGlos, which reach state-of-the-art accuracies on the parameter-limited setting of ImageNet classification benchmark.
arXiv Detail & Related papers (2022-07-01T03:36:49Z)
Towards Data-Efficient Detection Transformers [77.43470797296906]
We show most detection transformers suffer from significant performance drops on small-size datasets. We empirically analyze the factors that affect data efficiency, through a step-by-step transition from a data-efficient RCNN variant to the representative DETR. We introduce a simple yet effective label augmentation method to provide richer supervision and improve data efficiency.
arXiv Detail & Related papers (2022-03-17T17:56:34Z)
Improving Contrastive Learning by Visualizing Feature Transformation [37.548120912055595]
In this paper, we attempt to devise a feature-level data manipulation, differing from data augmentation, to enhance the generic contrastive self-supervised learning. We first design a visualization scheme for pos/neg score (Pos/neg score indicates similarity of pos/neg pair.) distribution, which enables us to analyze, interpret and understand the learning process. Experiment results show that our proposed Feature Transformation can improve at least 6.0% accuracy on ImageNet-100 over MoCo baseline, and about 2.0% accuracy on ImageNet-1K over the MoCoV2 baseline.
arXiv Detail & Related papers (2021-08-06T07:26:08Z)
Adversarial Feature Augmentation and Normalization for Visual Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models. Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings. We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.