UFO-ViT: High Performance Linear Vision Transformer without Softmax
- URL: http://arxiv.org/abs/2109.14382v1
- Date: Wed, 29 Sep 2021 12:32:49 GMT
- Title: UFO-ViT: High Performance Linear Vision Transformer without Softmax
- Authors: Jeong-geun Song
- Abstract summary: We propose the UFO-ViT(Unit Force Operated Vision Trnasformer), novel method to reduce the computations of self-attention by eliminating some non-linearity.
Model achieves most transformer-based models on image classification and dense prediction tasks through most capacity regime.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision transformers have become one of the most important models for computer
vision tasks. While they outperform earlier convolutional networks, the
complexity quadratic to $N$ is one of the major drawbacks when using
traditional self-attention algorithms. Here we propose the UFO-ViT(Unit Force
Operated Vision Trnasformer), novel method to reduce the computations of
self-attention by eliminating some non-linearity. Modifying few of lines from
self-attention, UFO-ViT achieves linear complexity without the degradation of
performance. The proposed models outperform most transformer-based models on
image classification and dense prediction tasks through most capacity regime.
Related papers
- Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model.
Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z) - Training Transformer Models by Wavelet Losses Improves Quantitative and Visual Performance in Single Image Super-Resolution [6.367865391518726]
Transformer-based models have achieved remarkable results in low-level vision tasks including image super-resolution (SR)
To activate more input pixels globally, hybrid attention models have been proposed.
We employ wavelet losses to train Transformer models to improve quantitative and subjective performance.
arXiv Detail & Related papers (2024-04-17T11:25:19Z) - PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity.
Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead.
We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z) - How to Train Vision Transformer on Small-scale Datasets? [4.56717163175988]
In contrast to convolutional neural networks, Vision Transformer lacks inherent inductive biases.
We show that self-supervised inductive biases can be learned directly from small-scale datasets.
This allows to train these models without large-scale pre-training, changes to model architecture or loss functions.
arXiv Detail & Related papers (2022-10-13T17:59:19Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - X-ViT: High Performance Linear Vision Transformer without Softmax [1.6244541005112747]
Vision transformers have become one of the most important models for computer vision tasks.
They require heavy computational resources on a scale that is quadratic to the number of tokens, $N$.
Here, we propose the X-ViT, ViT with a novel SA mechanism that has linear complexity.
arXiv Detail & Related papers (2022-05-27T07:47:22Z) - ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection.
vision transformers are the first fully transformer-based architecture for image classification.
In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z) - Vision Xformers: Efficient Attention for Image Classification [0.0]
We modify the ViT architecture to work on longer sequence data by replacing the quadratic attention with efficient transformers.
We show that ViX performs better than ViT in image classification consuming lesser computing resources.
arXiv Detail & Related papers (2021-07-05T19:24:23Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.