Related papers: FLatten Transformer: Vision Transformer using Focused Linear Attention

FLatten Transformer: Vision Transformer using Focused Linear Attention

URL: http://arxiv.org/abs/2308.00442v2
Date: Fri, 1 Sep 2023 08:01:36 GMT
Title: FLatten Transformer: Vision Transformer using Focused Linear Attention
Authors: Dongchen Han, Xuran Pan, Yizeng Han, Shiji Song, Gao Huang
Abstract summary: Linear attention offers a much more efficient alternative with its linear complexity. Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead. We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
Score: 80.61335173752146
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The quadratic computation complexity of self-attention has been a persistent challenge when applying Transformer models to vision tasks. Linear attention, on the other hand, offers a much more efficient alternative with its linear complexity by approximating the Softmax operation through carefully designed mapping functions. However, current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead from the mapping functions. In this paper, we propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness. Specifically, we first analyze the factors contributing to the performance degradation of linear attention from two perspectives: the focus ability and feature diversity. To overcome these limitations, we introduce a simple yet effective mapping function and an efficient rank restoration module to enhance the expressiveness of self-attention while maintaining low computation complexity. Extensive experiments show that our linear attention module is applicable to a variety of advanced vision Transformers, and achieves consistently improved performances on multiple benchmarks. Code is available at https://github.com/LeapLabTHU/FLatten-Transformer.

Related papers

Breaking Complexity Barriers: High-Resolution Image Restoration with Rank Enhanced Linear Attention [54.42902794496325]
Linear attention, a variant of softmax attention, demonstrates promise in global context modeling.<n>We propose Rank Enhanced Linear Attention (RELA), a simple yet effective method that enriches feature representations by integrating a lightweight depthwise convolution.<n>Building upon RELA, we propose an efficient and effective image restoration Transformer, named LAformer.
arXiv Detail & Related papers (2025-05-22T02:57:23Z)
Bridging the Divide: Reconsidering Softmax and Linear Attention [116.34723260730405]
We present two key perspectives to understand and alleviate the limitations of linear attention. We prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors. Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short.
arXiv Detail & Related papers (2024-12-09T15:44:22Z)
Breaking the Low-Rank Dilemma of Linear Attention [61.55583836370135]
Linear attention provides a far more efficient solution by reducing the complexity to linear levels. Our experiments indicate that this performance drop is due to the low-rank nature of linear attention's feature map. We introduce Rank-Augmented Linear Attention (RALA), which rivals the performance of Softmax attention while maintaining linear complexity and high efficiency.
arXiv Detail & Related papers (2024-11-12T08:30:59Z)
DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z)
Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
Vision computation transformers (ViTs) use expensive self-attention operations in every layer. We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers. We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
arXiv Detail & Related papers (2023-01-05T18:59:52Z)
ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention [23.874485033096917]
Vision Transformer (ViT) has emerged as a competitive alternative to convolutional neural networks for various computer vision applications. We propose a first-of-its-kind algorithm- hardware codesigned framework, dubbed ViTALiTy, for boosting the inference efficiency of ViTs. ViTALiTy unifies both low-rank and sparse components of the attention in ViTs.
arXiv Detail & Related papers (2022-11-09T18:58:21Z)
The Devil in Linear Transformer [42.232886799710215]
Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers. They usually suffer from degraded performances on various tasks and corpus. In this paper, we identify two key issues that lead to such performance gaps.
arXiv Detail & Related papers (2022-10-19T07:15:35Z)
Linear Video Transformer with Feature Fixation [34.324346469406926]
Vision Transformers have achieved impressive performance in video classification, while suffering from the quadratic complexity caused by the Softmax attention mechanism. We propose a feature fixation module to reweight the feature importance of the query and key before computing linear attention. We achieve state-of-the-art performance among linear video Transformers on three popular video classification benchmarks.
arXiv Detail & Related papers (2022-10-15T02:20:50Z)
DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency. The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on. Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z)
Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity. Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.