PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and
Progressive Shift
- URL: http://arxiv.org/abs/2304.03481v1
- Date: Fri, 7 Apr 2023 05:21:37 GMT
- Title: PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and
Progressive Shift
- Authors: Gaojie Wu, Wei-Shi Zheng, Yutong Lu, Qi Tian
- Abstract summary: Vision Transformer (ViT) has shown great potential for various visual tasks due to its ability to model long-range dependency.
We propose a ladder self-attention block with multiple branches and a progressive shift mechanism to develop a light-weight transformer backbone.
- Score: 139.17852337764586
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformer (ViT) has shown great potential for various visual tasks
due to its ability to model long-range dependency. However, ViT requires a
large amount of computing resource to compute the global self-attention. In
this work, we propose a ladder self-attention block with multiple branches and
a progressive shift mechanism to develop a light-weight transformer backbone
that requires less computing resources (e.g. a relatively small number of
parameters and FLOPs), termed Progressive Shift Ladder Transformer (PSLT).
First, the ladder self-attention block reduces the computational cost by
modelling local self-attention in each branch. In the meanwhile, the
progressive shift mechanism is proposed to enlarge the receptive field in the
ladder self-attention block by modelling diverse local self-attention for each
branch and interacting among these branches. Second, the input feature of the
ladder self-attention block is split equally along the channel dimension for
each branch, which considerably reduces the computational cost in the ladder
self-attention block (with nearly 1/3 the amount of parameters and FLOPs), and
the outputs of these branches are then collaborated by a pixel-adaptive fusion.
Therefore, the ladder self-attention block with a relatively small number of
parameters and FLOPs is capable of modelling long-range interactions. Based on
the ladder self-attention block, PSLT performs well on several vision tasks,
including image classification, objection detection and person
re-identification. On the ImageNet-1k dataset, PSLT achieves a top-1 accuracy
of 79.9% with 9.2M parameters and 1.9G FLOPs, which is comparable to several
existing models with more than 20M parameters and 4G FLOPs. Code is available
at https://isee-ai.cn/wugaojie/PSLT.html.
Related papers
- SCHEME: Scalable Channel Mixer for Vision Transformers [52.605868919281086]
Vision Transformers have achieved impressive performance in many vision tasks.
Much less research has been devoted to the channel mixer or feature mixing block (FFN or)
We show that the dense connections can be replaced with a diagonal block structure that supports larger expansion ratios.
arXiv Detail & Related papers (2023-12-01T08:22:34Z) - DAE-Former: Dual Attention-guided Efficient Transformer for Medical
Image Segmentation [3.9548535445908928]
We propose DAE-Former, a novel method that seeks to provide an alternative perspective by efficiently designing the self-attention mechanism.
Our method outperforms state-of-the-art methods on multi-organ cardiac and skin lesion segmentation datasets without requiring pre-training weights.
arXiv Detail & Related papers (2022-12-27T14:39:39Z) - Lite Vision Transformer with Enhanced Self-Attention [39.32480787105232]
We propose Lite Vision Transformer (LVT), a novel light-weight vision transformer network with two enhanced self-attention mechanisms.
For the low-level features, we introduce Convolutional Self-Attention (CSA)
For the high-level features, we propose Recursive Atrous Self-Attention (RASA)
arXiv Detail & Related papers (2021-12-20T19:11:53Z) - Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks.
We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer.
The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z) - ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [76.16156833138038]
We propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
In each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network.
arXiv Detail & Related papers (2021-06-07T05:31:06Z) - ULSAM: Ultra-Lightweight Subspace Attention Module for Compact
Convolutional Neural Networks [4.143032261649983]
"Ultra-Lightweight Subspace Attention Mechanism" (ULSAM) is end-to-end trainable and can be deployed as a plug-and-play module in compact convolutional neural networks (CNNs)
We achieve $approx$13% and $approx$25% reduction in both the FLOPs and parameter counts of MobileNet-V2 with a 0.27% and more than 1% improvement in top-1 accuracy on the ImageNet-1K and fine-grained image classification datasets (respectively)
arXiv Detail & Related papers (2020-06-26T17:05:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.