Related papers: ViT-LSLA: Vision Transformer with Light Self-Limited-Attention

ViT-LSLA: Vision Transformer with Light Self-Limited-Attention

URL: http://arxiv.org/abs/2210.17115v1
Date: Mon, 31 Oct 2022 07:46:45 GMT
Title: ViT-LSLA: Vision Transformer with Light Self-Limited-Attention
Authors: Zhenzhe Hechen, Wei Huang, Yixin Zhao
Abstract summary: This paper presents a light self-limited-attention (LSLA) consisting of a light self-attention mechanism (LSA) to save the computation cost and the number of parameters, and a self-limited-attention mechanism (SLA) to improve the performance. experiments show that ViT-LSLA achieves 71.6% top-1 accuracy on IP102; 87.2% top-1 accuracy on Mini-ImageNet.
Score: 4.903718320156974
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers have demonstrated a competitive performance across a wide range of vision tasks, while it is very expensive to compute the global self-attention. Many methods limit the range of attention within a local window to reduce computation complexity. However, their approaches cannot save the number of parameters; meanwhile, the self-attention and inner position bias (inside the softmax function) cause each query to focus on similar and close patches. Consequently, this paper presents a light self-limited-attention (LSLA) consisting of a light self-attention mechanism (LSA) to save the computation cost and the number of parameters, and a self-limited-attention mechanism (SLA) to improve the performance. Firstly, the LSA replaces the K (Key) and V (Value) of self-attention with the X(origin input). Applying it in vision Transformers which have encoder architecture and self-attention mechanism, can simplify the computation. Secondly, the SLA has a positional information module and a limited-attention module. The former contains a dynamic scale and an inner position bias to adjust the distribution of the self-attention scores and enhance the positional information. The latter uses an outer position bias after the softmax function to limit some large values of attention weights. Finally, a hierarchical Vision Transformer with Light self-Limited-attention (ViT-LSLA) is presented. The experiments show that ViT-LSLA achieves 71.6% top-1 accuracy on IP102 (2.4% absolute improvement of Swin-T); 87.2% top-1 accuracy on Mini-ImageNet (3.7% absolute improvement of Swin-T). Furthermore, it greatly reduces FLOPs (3.5GFLOPs vs. 4.5GFLOPs of Swin-T) and parameters (18.9M vs. 27.6M of Swin-T).

Related papers

Kernelized Sparse Fine-Tuning with Bi-level Parameter Competition for Vision Models [80.50996301430108]
sparse tuning achieves remarkable performance by adjusting only the weights most relevant to downstream tasks.<n>We propose a one-stage method named SNELLA to overcome the above limitations.<n> SNELLA achieves SOTA performance with low memory usage.
arXiv Detail & Related papers (2025-10-28T03:39:18Z)
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention [88.47701139980636]
In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck.<n>We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank.<n>We propose SLA, a trainable attention method that fuses sparse and linear attention to accelerate diffusion models.
arXiv Detail & Related papers (2025-09-28T17:58:59Z)
Transforming Indoor Localization: Advanced Transformer Architecture for NLOS Dominated Wireless Environments with Distributed Sensors [7.630782404476683]
We introduce a novel tokenization approach, referred to as Sensor Snapshot Tokenization (SST), which preserves variable-specific representations of power delay profile ( PDP) We also propose a lightweight Swish-Gated Linear Unit-based Transformer (L-SwiGLU Transformer) model, designed to reduce computational complexity without compromising localization accuracy.
arXiv Detail & Related papers (2025-01-14T01:16:30Z)
LASER: Attention with Exponential Transformation [20.1832156343096]
We analyze the gradients backpropagated through the softmax operation in the attention mechanism and observe that these gradients can often be small. We introduce a new attention mechanism called LASER, which we analytically show to admit a larger gradient signal. We show that LASER Attention can be implemented by making small modifications to existing attention implementations.
arXiv Detail & Related papers (2024-11-05T20:18:28Z)
StableMask: Refining Causal Masking in Decoder-only Transformer [22.75632485195928]
decoder-only Transformer architecture with causal masking and relative position encoding (RPE) has become the de facto choice in language modeling. However, it requires all attention scores to be non-zero and sum up to 1, even if the current embedding has sufficient self-contained information. We propose StableMask: a parameter-free method to address both limitations by refining the causal mask.
arXiv Detail & Related papers (2024-02-07T12:01:02Z)
SeTformer is What You Need for Vision and Language [26.036537788653373]
Self-optimal Transport (SeT) is a novel transformer for achieving better performance and computational efficiency. SeTformer achieves impressive top-1 accuracies of 84.7% and 86.2% on ImageNet-1K. SeTformer also achieves state-of-the-art results in language modeling on the GLUE benchmark.
arXiv Detail & Related papers (2024-01-07T16:52:49Z)
PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift [139.17852337764586]
Vision Transformer (ViT) has shown great potential for various visual tasks due to its ability to model long-range dependency. We propose a ladder self-attention block with multiple branches and a progressive shift mechanism to develop a light-weight transformer backbone.
arXiv Detail & Related papers (2023-04-07T05:21:37Z)
A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z)
Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP [52.25478388220691]
Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks. They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers. We propose a new positional spacial gating unit (PoSGU) to efficiently encode the cross-token relations for token mixing.
arXiv Detail & Related papers (2022-07-15T04:18:06Z)
Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. Existing methods are either theoretically flawed or empirically ineffective for visual recognition. We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z)
Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity. Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z)
Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention [28.44439386445018]
We propose a Pale-Shaped self-Attention, which performs self-attention within a pale-shaped region. Compared to the global self-attention, PS-Attention can reduce the computation and memory costs significantly. We develop a general Vision Transformer backbone with a hierarchical architecture, named Pale Transformer, which achieves 83.4%, 84.3%, and 84.9% Top-1 accuracy with the model size of 22M, 48M, and 85M respectively.
arXiv Detail & Related papers (2021-12-28T05:37:24Z)
Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks. We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer. The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.