X-ViT: High Performance Linear Vision Transformer without Softmax
- URL: http://arxiv.org/abs/2205.13805v1
- Date: Fri, 27 May 2022 07:47:22 GMT
- Title: X-ViT: High Performance Linear Vision Transformer without Softmax
- Authors: Jeonggeun Song, Heung-Chang Lee
- Abstract summary: Vision transformers have become one of the most important models for computer vision tasks.
They require heavy computational resources on a scale that is quadratic to the number of tokens, $N$.
Here, we propose the X-ViT, ViT with a novel SA mechanism that has linear complexity.
- Score: 1.6244541005112747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers have become one of the most important models for computer
vision tasks. Although they outperform prior works, they require heavy
computational resources on a scale that is quadratic to the number of tokens,
$N$. This is a major drawback of the traditional self-attention (SA) algorithm.
Here, we propose the X-ViT, ViT with a novel SA mechanism that has linear
complexity. The main approach of this work is to eliminate nonlinearity from
the original SA. We factorize the matrix multiplication of the SA mechanism
without complicated linear approximation. By modifying only a few lines of code
from the original SA, the proposed models outperform most transformer-based
models on image classification and dense prediction tasks on most capacity
regimes.
Related papers
- Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models [92.36510016591782]
We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs)
Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens.
Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
arXiv Detail & Related papers (2024-08-19T17:48:11Z) - PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity.
Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead.
We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks.
We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer.
The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z) - UFO-ViT: High Performance Linear Vision Transformer without Softmax [0.0]
We propose the UFO-ViT(Unit Force Operated Vision Trnasformer), novel method to reduce the computations of self-attention by eliminating some non-linearity.
Model achieves most transformer-based models on image classification and dense prediction tasks through most capacity regime.
arXiv Detail & Related papers (2021-09-29T12:32:49Z) - Vision Xformers: Efficient Attention for Image Classification [0.0]
We modify the ViT architecture to work on longer sequence data by replacing the quadratic attention with efficient transformers.
We show that ViX performs better than ViT in image classification consuming lesser computing resources.
arXiv Detail & Related papers (2021-07-05T19:24:23Z) - THG: Transformer with Hyperbolic Geometry [8.895324519034057]
"X-former" models make changes only around the quadratic time and memory complexity of self-attention.
We propose a novel Transformer with Hyperbolic Geometry (THG) model, which take the advantage of both Euclidean space and Hyperbolic space.
arXiv Detail & Related papers (2021-06-01T14:09:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.