ViT-LSLA: Vision Transformer with Light Self-Limited-Attention
- URL: http://arxiv.org/abs/2210.17115v1
- Date: Mon, 31 Oct 2022 07:46:45 GMT
- Title: ViT-LSLA: Vision Transformer with Light Self-Limited-Attention
- Authors: Zhenzhe Hechen, Wei Huang, Yixin Zhao
- Abstract summary: This paper presents a light self-limited-attention (LSLA) consisting of a light self-attention mechanism (LSA) to save the computation cost and the number of parameters, and a self-limited-attention mechanism (SLA) to improve the performance.
experiments show that ViT-LSLA achieves 71.6% top-1 accuracy on IP102; 87.2% top-1 accuracy on Mini-ImageNet.
- Score: 4.903718320156974
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have demonstrated a competitive performance across a wide range
of vision tasks, while it is very expensive to compute the global
self-attention. Many methods limit the range of attention within a local window
to reduce computation complexity. However, their approaches cannot save the
number of parameters; meanwhile, the self-attention and inner position bias
(inside the softmax function) cause each query to focus on similar and close
patches. Consequently, this paper presents a light self-limited-attention
(LSLA) consisting of a light self-attention mechanism (LSA) to save the
computation cost and the number of parameters, and a self-limited-attention
mechanism (SLA) to improve the performance. Firstly, the LSA replaces the K
(Key) and V (Value) of self-attention with the X(origin input). Applying it in
vision Transformers which have encoder architecture and self-attention
mechanism, can simplify the computation. Secondly, the SLA has a positional
information module and a limited-attention module. The former contains a
dynamic scale and an inner position bias to adjust the distribution of the
self-attention scores and enhance the positional information. The latter uses
an outer position bias after the softmax function to limit some large values of
attention weights. Finally, a hierarchical Vision Transformer with Light
self-Limited-attention (ViT-LSLA) is presented. The experiments show that
ViT-LSLA achieves 71.6% top-1 accuracy on IP102 (2.4% absolute improvement of
Swin-T); 87.2% top-1 accuracy on Mini-ImageNet (3.7% absolute improvement of
Swin-T). Furthermore, it greatly reduces FLOPs (3.5GFLOPs vs. 4.5GFLOPs of
Swin-T) and parameters (18.9M vs. 27.6M of Swin-T).
Related papers
- Kernelized Sparse Fine-Tuning with Bi-level Parameter Competition for Vision Models [80.50996301430108]
sparse tuning achieves remarkable performance by adjusting only the weights most relevant to downstream tasks.<n>We propose a one-stage method named SNELLA to overcome the above limitations.<n> SNELLA achieves SOTA performance with low memory usage.
arXiv Detail & Related papers (2025-10-28T03:39:18Z) - SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention [88.47701139980636]
In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck.<n>We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank.<n>We propose SLA, a trainable attention method that fuses sparse and linear attention to accelerate diffusion models.
arXiv Detail & Related papers (2025-09-28T17:58:59Z) - Transforming Indoor Localization: Advanced Transformer Architecture for NLOS Dominated Wireless Environments with Distributed Sensors [7.630782404476683]
We introduce a novel tokenization approach, referred to as Sensor Snapshot Tokenization (SST), which preserves variable-specific representations of power delay profile ( PDP)
We also propose a lightweight Swish-Gated Linear Unit-based Transformer (L-SwiGLU Transformer) model, designed to reduce computational complexity without compromising localization accuracy.
arXiv Detail & Related papers (2025-01-14T01:16:30Z) - LASER: Attention with Exponential Transformation [20.1832156343096]
We analyze the gradients backpropagated through the softmax operation in the attention mechanism and observe that these gradients can often be small.
We introduce a new attention mechanism called LASER, which we analytically show to admit a larger gradient signal.
We show that LASER Attention can be implemented by making small modifications to existing attention implementations.
arXiv Detail & Related papers (2024-11-05T20:18:28Z) - StableMask: Refining Causal Masking in Decoder-only Transformer [22.75632485195928]
decoder-only Transformer architecture with causal masking and relative position encoding (RPE) has become the de facto choice in language modeling.
However, it requires all attention scores to be non-zero and sum up to 1, even if the current embedding has sufficient self-contained information.
We propose StableMask: a parameter-free method to address both limitations by refining the causal mask.
arXiv Detail & Related papers (2024-02-07T12:01:02Z) - SeTformer is What You Need for Vision and Language [26.036537788653373]
Self-optimal Transport (SeT) is a novel transformer for achieving better performance and computational efficiency.
SeTformer achieves impressive top-1 accuracies of 84.7% and 86.2% on ImageNet-1K.
SeTformer also achieves state-of-the-art results in language modeling on the GLUE benchmark.
arXiv Detail & Related papers (2024-01-07T16:52:49Z) - PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and
Progressive Shift [139.17852337764586]
Vision Transformer (ViT) has shown great potential for various visual tasks due to its ability to model long-range dependency.
We propose a ladder self-attention block with multiple branches and a progressive shift mechanism to develop a light-weight transformer backbone.
arXiv Detail & Related papers (2023-04-07T05:21:37Z) - A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z) - Parameterization of Cross-Token Relations with Relative Positional
Encoding for Vision MLP [52.25478388220691]
Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks.
They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers.
We propose a new positional spacial gating unit (PoSGU) to efficiently encode the cross-token relations for token mixing.
arXiv Detail & Related papers (2022-07-15T04:18:06Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped
Attention [28.44439386445018]
We propose a Pale-Shaped self-Attention, which performs self-attention within a pale-shaped region.
Compared to the global self-attention, PS-Attention can reduce the computation and memory costs significantly.
We develop a general Vision Transformer backbone with a hierarchical architecture, named Pale Transformer, which achieves 83.4%, 84.3%, and 84.9% Top-1 accuracy with the model size of 22M, 48M, and 85M respectively.
arXiv Detail & Related papers (2021-12-28T05:37:24Z) - Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks.
We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer.
The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.