Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks
- URL: http://arxiv.org/abs/2204.07780v1
- Date: Sat, 16 Apr 2022 11:30:26 GMT
- Title: Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks
- Authors: Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Yan Wang, Liujuan Cao, Yongjian Wu,
Feiyue Huang, Rongrong Ji
- Abstract summary: We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer.
We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
- Score: 126.33843752332139
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the exciting performance, Transformer is criticized for its excessive
parameters and computation cost. However, compressing Transformer remains as an
open problem due to its internal complexity of the layer designs, i.e.,
Multi-Head Attention (MHA) and Feed-Forward Network (FFN). To address this
issue, we introduce Group-wise Transformation towards a universal yet
lightweight Transformer for vision-and-language tasks, termed as
LW-Transformer. LW-Transformer applies Group-wise Transformation to reduce both
the parameters and computations of Transformer, while also preserving its two
main properties, i.e., the efficient attention modeling on diverse subspaces of
MHA, and the expanding-scaling feature transformation of FFN. We apply
LW-Transformer to a set of Transformer-based networks, and quantitatively
measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and
computations, LW-Transformer achieves very competitive performance against the
original Transformer networks for vision-and-language tasks. To examine the
generalization ability, we also apply our optimization strategy to a recently
proposed image Transformer called Swin-Transformer for image classification,
where the effectiveness can be also confirmed
Related papers
- Efficient Visual Transformer by Learnable Token Merging [8.905020033545643]
We propose a novel transformer block, Transformer with Learnable Token Merging (LTM), or LTM-Transformer.
LTM-Transformer is compatible with many popular and compact transformer networks.
It renders compact and efficient visual transformers with comparable or much better prediction accuracy than the original visual transformers.
arXiv Detail & Related papers (2024-07-21T17:09:19Z) - Vision Transformer with Quadrangle Attention [76.35955924137986]
We propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation.
Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles.
We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost.
arXiv Detail & Related papers (2023-03-27T11:13:50Z) - SSformer: A Lightweight Transformer for Semantic Segmentation [7.787950060560868]
Swin Transformer set a new record in various vision tasks by using hierarchical architecture and shifted windows.
We design a lightweight yet effective transformer model, called SSformer.
Experimental results show the proposed SSformer yields comparable mIoU performance with state-of-the-art models.
arXiv Detail & Related papers (2022-08-03T12:57:00Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - The Nuts and Bolts of Adopting Transformer in GANs [124.30856952272913]
We investigate the properties of Transformer in the generative adversarial network (GAN) framework for high-fidelity image synthesis.
Our study leads to a new alternative design of Transformers in GAN, a convolutional neural network (CNN)-free generator termed as STrans-G.
arXiv Detail & Related papers (2021-10-25T17:01:29Z) - Glance-and-Gaze Vision Transformer [13.77016463781053]
We propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer)
It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes.
We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers.
arXiv Detail & Related papers (2021-06-04T06:13:47Z) - Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation.
We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters.
A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.