Linear Video Transformer with Feature Fixation
- URL: http://arxiv.org/abs/2210.08164v1
- Date: Sat, 15 Oct 2022 02:20:50 GMT
- Title: Linear Video Transformer with Feature Fixation
- Authors: Kaiyue Lu, Zexiang Liu, Jianyuan Wang, Weixuan Sun, Zhen Qin, Dong Li,
Xuyang Shen, Hui Deng, Xiaodong Han, Yuchao Dai, Yiran Zhong
- Abstract summary: Vision Transformers have achieved impressive performance in video classification, while suffering from the quadratic complexity caused by the Softmax attention mechanism.
We propose a feature fixation module to reweight the feature importance of the query and key before computing linear attention.
We achieve state-of-the-art performance among linear video Transformers on three popular video classification benchmarks.
- Score: 34.324346469406926
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers have achieved impressive performance in video
classification, while suffering from the quadratic complexity caused by the
Softmax attention mechanism. Some studies alleviate the computational costs by
reducing the number of tokens in attention calculation, but the complexity is
still quadratic. Another promising way is to replace Softmax attention with
linear attention, which owns linear complexity but presents a clear performance
drop. We find that such a drop in linear attention results from the lack of
attention concentration on critical features. Therefore, we propose a feature
fixation module to reweight the feature importance of the query and key before
computing linear attention. Specifically, we regard the query, key, and value
as various latent representations of the input token, and learn the feature
fixation ratio by aggregating Query-Key-Value information. This is beneficial
for measuring the feature importance comprehensively. Furthermore, we enhance
the feature fixation by neighborhood association, which leverages additional
guidance from spatial and temporal neighbouring tokens. The proposed method
significantly improves the linear attention baseline and achieves
state-of-the-art performance among linear video Transformers on three popular
video classification benchmarks. With fewer parameters and higher efficiency,
our performance is even comparable to some Softmax-based quadratic
Transformers.
Related papers
- Breaking the Low-Rank Dilemma of Linear Attention [61.55583836370135]
Linear attention provides a far more efficient solution by reducing the complexity to linear levels.
Our experiments indicate that this performance drop is due to the low-rank nature of linear attention's feature map.
We introduce Rank-Augmented Linear Attention (RALA), which rivals the performance of Softmax attention while maintaining linear complexity and high efficiency.
arXiv Detail & Related papers (2024-11-12T08:30:59Z) - DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture.
We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z) - FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification.
Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z) - FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity.
Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead.
We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - Efficient Linear Attention for Fast and Accurate Keypoint Matching [0.9699586426043882]
Recently Transformers have provided state-of-the-art performance in sparse matching, crucial to realize high-performance 3D vision applications.
Yet, these Transformers lack efficiency due to the quadratic computational complexity of their attention mechanism.
We propose a new attentional aggregation that achieves high accuracy by aggregating both the global and local information from sparse keypoints.
arXiv Detail & Related papers (2022-04-16T06:17:36Z) - cosFormer: Rethinking Softmax in Attention [60.557869510885205]
kernel methods are often adopted to reduce the complexity by approximating the softmax operator.
Due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops.
We propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer.
arXiv Detail & Related papers (2022-02-17T17:53:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.