Sliced Recursive Transformer
- URL: http://arxiv.org/abs/2111.05297v1
- Date: Tue, 9 Nov 2021 17:59:14 GMT
- Title: Sliced Recursive Transformer
- Authors: Zhiqiang Shen and Zechun Liu and Eric Xing
- Abstract summary: Recursive operation on vision transformers can improve parameter utilization without involving additional parameters.
Our model Sliced Recursive Transformer (SReT) is compatible with a broad range of other designs for efficient vision transformers.
- Score: 23.899076070924153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a neat yet effective recursive operation on vision transformers
that can improve parameter utilization without involving additional parameters.
This is achieved by sharing weights across depth of transformer networks. The
proposed method can obtain a substantial gain (~2%) simply using na\"ive
recursive operation, requires no special or sophisticated knowledge for
designing principles of networks, and introduces minimum computational overhead
to the training procedure. To reduce the additional computation caused by
recursive operation while maintaining the superior accuracy, we propose an
approximating method through multiple sliced group self-attentions across
recursive layers which can reduce the cost consumption by 10~30% with minimal
performance loss. We call our model Sliced Recursive Transformer (SReT), which
is compatible with a broad range of other designs for efficient vision
transformers. Our best model establishes significant improvement on ImageNet
over state-of-the-art methods while containing fewer parameters. The proposed
sliced recursive operation allows us to build a transformer with more than 100
or even 1000 layers effortlessly under a still small size (13~15M), to avoid
difficulties in optimization when the model size is too large. The flexible
scalability has shown great potential for scaling up and constructing extremely
deep and large dimensionality vision transformers. Our code and models are
available at https://github.com/szq0214/SReT.
Related papers
- Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA [38.30350849992281]
"Recursive" language models share parameters across layers with minimal loss of performance.
Recursive Transformers are efficiently from standard pretrained Transformers, but only use a single block of unique layers that is then repeated multiple times in a loop.
We show that our models outperform both similar-sized vanilla pretrained models and knowledge distillation baselines.
arXiv Detail & Related papers (2024-10-28T02:15:45Z) - Pruning By Explaining Revisited: Optimizing Attribution Methods to Prune CNNs and Transformers [14.756988176469365]
An effective approach to reduce computational requirements and increase efficiency is to prune unnecessary components of Deep Neural Networks.
Previous work has shown that attribution methods from the field of eXplainable AI serve as effective means to extract and prune the least relevant network components in a few-shot fashion.
arXiv Detail & Related papers (2024-08-22T17:35:18Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - TORE: Token Reduction for Efficient Human Mesh Recovery with Transformer [34.46696132157042]
We introduce a set of simple yet effective TOken REduction strategies for Transformer-based Human Mesh Recovery from monocular images.
We propose token reduction strategies based on two important aspects, i.e., the 3D geometry structure and 2D image feature.
Our method massively reduces the number of tokens involved in high-complexity interactions in the Transformer.
arXiv Detail & Related papers (2022-11-19T14:06:58Z) - DRT: A Lightweight Single Image Deraining Recursive Transformer [21.889582347604648]
Over parameterization is a common technique in deep learning to help models learn and generalize sufficiently to the given task.
Recent powerful transformer-based deep learning models on vision tasks usually have heavy parameters and bear training difficulty.
We introduce a self-attention structure with residual connections and propose deraining a recursive transformer (DRT)
Our proposed model uses only 1.3% of the number of parameters of the current best performing model in deraining while exceeding the state-of-the-art methods on the Rain100L benchmark by at least 0.33 dB.
arXiv Detail & Related papers (2022-04-25T01:06:09Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - FQ-ViT: Fully Quantized Vision Transformer without Retraining [13.82845665713633]
We present a systematic method to reduce the performance degradation and inference complexity of Quantized Transformers.
We are the first to achieve comparable accuracy degradation (1%) on fully quantized Vision Transformers.
arXiv Detail & Related papers (2021-11-27T06:20:53Z) - HRFormer: High-Resolution Transformer for Dense Prediction [99.6060997466614]
We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks.
We take advantage of the multi-resolution parallel design introduced in high-resolution convolutional networks (HRNet)
We demonstrate the effectiveness of the High-Resolution Transformer on both human pose estimation and semantic segmentation tasks.
arXiv Detail & Related papers (2021-10-18T15:37:58Z) - CMT: Convolutional Neural Networks Meet Vision Transformers [68.10025999594883]
Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image.
There are still gaps in both performance and computational cost between transformers and existing convolutional neural networks (CNNs)
We propose a new transformer based hybrid network by taking advantage of transformers to capture long-range dependencies, and of CNNs to model local features.
In particular, our CMT-S achieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller on FLOPs than the existing DeiT and EfficientNet, respectively.
arXiv Detail & Related papers (2021-07-13T17:47:19Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.