Advancing Vision Transformers with Group-Mix Attention
- URL: http://arxiv.org/abs/2311.15157v1
- Date: Sun, 26 Nov 2023 01:25:03 GMT
- Title: Advancing Vision Transformers with Group-Mix Attention
- Authors: Chongjian Ge, Xiaohan Ding, Zhan Tong, Li Yuan, Jiangliu Wang, Yibing
Song, Ping Luo
- Abstract summary: Group-Mix Attention (GMA) is an advanced replacement for traditional self-attention.
GMA simultaneously captures token-to-token, token-to-group, and group-to-group correlations with various group sizes.
GroupMixFormer achieves state-of-the-art performance in image classification, object detection, and semantic segmentation.
- Score: 59.585623293856735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViTs) have been shown to enhance visual recognition
through modeling long-range dependencies with multi-head self-attention (MHSA),
which is typically formulated as Query-Key-Value computation. However, the
attention map generated from the Query and Key captures only token-to-token
correlations at one single granularity. In this paper, we argue that
self-attention should have a more comprehensive mechanism to capture
correlations among tokens and groups (i.e., multiple adjacent tokens) for
higher representational capacity. Thereby, we propose Group-Mix Attention (GMA)
as an advanced replacement for traditional self-attention, which can
simultaneously capture token-to-token, token-to-group, and group-to-group
correlations with various group sizes. To this end, GMA splits the Query, Key,
and Value into segments uniformly and performs different group aggregations to
generate group proxies. The attention map is computed based on the mixtures of
tokens and group proxies and used to re-combine the tokens and groups in Value.
Based on GMA, we introduce a powerful backbone, namely GroupMixFormer, which
achieves state-of-the-art performance in image classification, object
detection, and semantic segmentation with fewer parameters than existing
models. For instance, GroupMixFormer-L (with 70.3M parameters and 384^2 input)
attains 86.2% Top-1 accuracy on ImageNet-1K without external data, while
GroupMixFormer-B (with 45.8M parameters) attains 51.2% mIoU on ADE20K.
Related papers
- GroupedMixer: An Entropy Model with Group-wise Token-Mixers for Learned Image Compression [64.47244912937204]
We propose a novel transformer-based entropy model called GroupedMixer.
GroupedMixer enjoys both faster coding speed and better compression performance than previous transformer-based methods.
Experimental results demonstrate that the proposed GroupedMixer yields the state-of-the-art rate-distortion performance with fast compression speed.
arXiv Detail & Related papers (2024-05-02T10:48:22Z) - Towards Open-World Co-Salient Object Detection with Generative
Uncertainty-aware Group Selective Exchange-Masking [23.60044777118441]
We introduce a group selective exchange-masking (GSEM) approach for enhancing the robustness of the CoSOD model.
GSEM selects a subset of images from each group using a novel learning-based strategy, then the selected images are exchanged.
To simultaneously consider the uncertainty introduced by irrelevant images and the consensus features of the remaining relevant images in the group, we designed a latent variable generator branch and CoSOD transformer branch.
arXiv Detail & Related papers (2023-10-16T10:40:40Z) - ClusterFormer: Clustering As A Universal Visual Learner [80.79669078819562]
CLUSTERFORMER is a universal vision model based on the CLUSTERing paradigm with TransFORMER.
It is capable of tackling heterogeneous vision tasks with varying levels of clustering granularity.
For its efficacy, we hope our work can catalyze a paradigm shift in universal models in computer vision.
arXiv Detail & Related papers (2023-09-22T22:12:30Z) - HGFormer: Hierarchical Grouping Transformer for Domain Generalized
Semantic Segmentation [113.6560373226501]
This work studies semantic segmentation under the domain generalization setting.
We propose a novel hierarchical grouping transformer (HGFormer) to explicitly group pixels to form part-level masks and then whole-level masks.
Experiments show that HGFormer yields more robust semantic segmentation results than per-pixel classification methods and flat grouping transformers.
arXiv Detail & Related papers (2023-05-22T13:33:41Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Group-CAM: Group Score-Weighted Visual Explanations for Deep
Convolutional Networks [4.915848175689936]
We propose an efficient saliency map generation method, called Group score-weighted Class Activation Mapping (Group-CAM)
Group-CAM is efficient yet effective, which only requires dozens of queries to the network while producing target-related saliency maps.
arXiv Detail & Related papers (2021-03-25T14:16:02Z) - CoADNet: Collaborative Aggregation-and-Distribution Networks for
Co-Salient Object Detection [91.91911418421086]
Co-Salient Object Detection (CoSOD) aims at discovering salient objects that repeatedly appear in a given query group containing two or more relevant images.
One challenging issue is how to effectively capture co-saliency cues by modeling and exploiting inter-image relationships.
We present an end-to-end collaborative aggregation-and-distribution network (CoADNet) to capture both salient and repetitive visual patterns from multiple images.
arXiv Detail & Related papers (2020-11-10T04:28:11Z) - Fast Transformers with Clustered Attention [14.448898156256478]
We propose clustered attention, which instead of computing the attention for every query, groups queries into clusters and computes attention just for the centroids.
This results in a model with linear complexity with respect to the sequence length for a fixed number of clusters.
We evaluate our approach on two automatic speech recognition datasets and show that our model consistently outperforms vanilla transformers for a given computational budget.
arXiv Detail & Related papers (2020-07-09T14:17:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.