Softmax-free Linear Transformers
- URL: http://arxiv.org/abs/2207.03341v3
- Date: Fri, 15 Mar 2024 00:52:40 GMT
- Title: Softmax-free Linear Transformers
- Authors: Jiachen Lu, Junge Zhang, Xiatian Zhu, Jianfeng Feng, Tao Xiang, Li Zhang,
- Abstract summary: Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
- Score: 90.83157268265654
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. The self-attention mechanism underpinning the strength of ViTs has a quadratic complexity in both computation and memory usage. This motivates the development of approximating the self-attention at linear complexity. However, an in-depth analysis in this work reveals that existing methods are either theoretically flawed or empirically ineffective for visual recognition. We identify that their limitations are rooted in the inheritance of softmax-based self-attention during approximations, that is, normalizing the scaled dot-product between token feature vectors using the softmax function. As preserving the softmax operation challenges any subsequent linearization efforts. By this insight, a family of Softmax-Free Transformers (SOFT) are proposed. Specifically, a Gaussian kernel function is adopted to replace the dot-product similarity, enabling a full self-attention matrix to be approximated under low-rank matrix decomposition. For computational robustness, we estimate the Moore-Penrose inverse using an iterative Newton-Raphson method in the forward process only, while calculating its theoretical gradients only once in the backward process. To further expand applicability (e.g., dense prediction tasks), an efficient symmetric normalization technique is introduced. Extensive experiments on ImageNet, COCO, and ADE20K show that our SOFT significantly improves the computational efficiency of existing ViT variants. With linear complexity, much longer token sequences are permitted by SOFT, resulting in superior trade-off between accuracy and complexity. Code and models are available at https://github.com/fudan-zvg/SOFT.
Related papers
- Stochastic Optimization for Non-convex Problem with Inexact Hessian
Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties.
We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z) - Convex Bounds on the Softmax Function with Applications to Robustness
Verification [69.09991317119679]
The softmax function is a ubiquitous component at the output of neural networks and increasingly in intermediate layers as well.
This paper provides convex lower bounds and concave upper bounds on the softmax function, which are compatible with convex optimization formulations for characterizing neural networks and other ML models.
arXiv Detail & Related papers (2023-03-03T05:07:02Z) - Low-rank Tensor Learning with Nonconvex Overlapped Nuclear Norm
Regularization [44.54772242784423]
We develop an efficient non regularization algorithm for low-rank learning matrix.
The proposed algorithm can avoid expensive folding/unfolding problems.
Experiments show that the proposed algorithm is efficient and more space than the existing state-of-the-world.
arXiv Detail & Related papers (2022-05-06T07:47:10Z) - cosFormer: Rethinking Softmax in Attention [60.557869510885205]
kernel methods are often adopted to reduce the complexity by approximating the softmax operator.
Due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops.
We propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer.
arXiv Detail & Related papers (2022-02-17T17:53:48Z) - Fast Differentiable Matrix Square Root and Inverse Square Root [65.67315418971688]
We propose two more efficient variants to compute the differentiable matrix square root and the inverse square root.
For the forward propagation, one method is to use Matrix Taylor Polynomial (MTP), and the other method is to use Matrix Pad'e Approximants (MPA)
A series of numerical tests show that both methods yield considerable speed-up compared with the SVD or the NS iteration.
arXiv Detail & Related papers (2022-01-29T10:00:35Z) - FC2T2: The Fast Continuous Convolutional Taylor Transform with
Applications in Vision and Graphics [8.629912408966145]
We revisit the Taylor series expansion from a modern Machine Learning perspective.
We introduce the Fast Continuous Convolutional Taylor Transform (FC2T2), a variant of the Fast Multipole Method (FMM), that allows for the efficient approximation of low dimensional convolutional operators in continuous space.
arXiv Detail & Related papers (2021-10-29T22:58:42Z) - SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention.
Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing.
We identify that their limitations are rooted in keeping the softmax self-attention during approximations.
For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z) - SHINE: SHaring the INverse Estimate from the forward pass for bi-level
optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks.
The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix.
We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.