Efficient Visual Transformer by Learnable Token Merging
- URL: http://arxiv.org/abs/2407.15219v1
- Date: Sun, 21 Jul 2024 17:09:19 GMT
- Title: Efficient Visual Transformer by Learnable Token Merging
- Authors: Yancheng Wang, Yingzhen Yang,
- Abstract summary: We propose a novel transformer block, Transformer with Learnable Token Merging (LTM), or LTM-Transformer.
LTM-Transformer is compatible with many popular and compact transformer networks.
It renders compact and efficient visual transformers with comparable or much better prediction accuracy than the original visual transformers.
- Score: 8.905020033545643
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-attention and transformers have been widely used in deep learning. Recent efforts have been devoted to incorporating transformer blocks into different neural architectures, including those with convolutions, leading to various visual transformers for computer vision tasks. In this paper, we propose a novel and compact transformer block, Transformer with Learnable Token Merging (LTM), or LTM-Transformer. LTM-Transformer performs token merging in a learnable scheme. LTM-Transformer is compatible with many popular and compact transformer networks, and it reduces the FLOPs and the inference time of the visual transformers while maintaining or even improving the prediction accuracy. In the experiments, we replace all the transformer blocks in popular visual transformers, including MobileViT, EfficientViT, ViT-S/16, and Swin-T, with LTM-Transformer blocks, leading to LTM-Transformer networks with different backbones. The LTM-Transformer is motivated by reduction of Information Bottleneck, and a novel and separable variational upper bound for the IB loss is derived. The architecture of mask module in our LTM blocks which generate the token merging mask is designed to reduce the derived upper bound for the IB loss. Extensive results on computer vision tasks evidence that LTM-Transformer renders compact and efficient visual transformers with comparable or much better prediction accuracy than the original visual transformers. The code of the LTM-Transformer is available at \url{https://github.com/Statistical-Deep-Learning/LTM}.
Related papers
- MABViT -- Modified Attention Block Enhances Vision Transformers [0.0]
We propose a novel transformer variant that integrates non-linearity within the attention block to tackle this problem.
We implement the GLU-based activation function on the Value tensor, and this new technique surpasses the current state-of-the-art S/16 variant of Vision Transformers by 0.6% on the ImageNet-1K dataset.
arXiv Detail & Related papers (2023-12-03T09:00:31Z) - iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions.
The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z) - A K-variate Time Series Is Worth K Words: Evolution of the Vanilla
Transformer Architecture for Long-term Multivariate Time Series Forecasting [52.33042819442005]
Transformer has become the de facto solution for MTSF, especially for the long-term cases.
In this study, we point out that the current tokenization strategy in MTSF Transformer architectures ignores the token inductive bias of Transformers.
We make a series of evolution on the basic architecture of the vanilla MTSF transformer.
Surprisingly, the evolved simple transformer architecture is highly effective, which successfully avoids the over-smoothing phenomena in the vanilla MTSF transformer.
arXiv Detail & Related papers (2022-12-06T07:00:31Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer.
We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z) - Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation.
We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters.
A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z) - Incorporating Convolution Designs into Visual Transformers [24.562955955312187]
We propose a new textbfConvolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies.
Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers.
arXiv Detail & Related papers (2021-03-22T13:16:12Z) - Rewiring the Transformer with Depth-Wise LSTMs [55.50278212605607]
We present a Transformer with depth-wise LSTMs connecting cascading Transformer layers and sub-layers.
Experiments with the 6-layer Transformer show significant BLEU improvements in both WMT 14 English-German / French tasks and the OPUS-100 many-to-many multilingual NMT task.
arXiv Detail & Related papers (2020-07-13T09:19:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.