Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads
- URL: http://arxiv.org/abs/2406.19391v3
- Date: Fri, 20 Dec 2024 02:12:06 GMT
- Title: Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads
- Authors: Ali Khaleghi Rahimian, Manish Kumar Govind, Subhajit Maity, Dominick Reilly, Christian Kümmerle, Srijan Das, Aritra Dutta,
- Abstract summary: We propose a new multi-head self-attention (MHSA) variant named Fibottention, which can replace MHSA in Transformer architectures.
Fibottention is data-efficient and computationally more suitable for processing large numbers of tokens than the standard MHSA.
It employs structured sparse attention based on dilated Fibonacci sequences, which, uniquely, differ across attention heads, resulting in-like diverse features across heads.
- Score: 10.169639612525643
- License:
- Abstract: Transformer architectures such as Vision Transformers (ViT) have proven effective for solving visual perception tasks. However, they suffer from two major limitations; first, the quadratic complexity of self-attention limits the number of tokens that can be processed, and second, Transformers often require large amounts of training data to attain state-of-the-art performance. In this paper, we propose a new multi-head self-attention (MHSA) variant named Fibottention, which can replace MHSA in Transformer architectures. Fibottention is data-efficient and computationally more suitable for processing large numbers of tokens than the standard MHSA. It employs structured sparse attention based on dilated Fibonacci sequences, which, uniquely, differ across attention heads, resulting in inception-like diverse features across heads. The spacing of the Fibonacci sequences follows the Wythoff array, which minimizes the redundancy of token interactions aggregated across different attention heads, while still capturing sufficient complementary information through token pair interactions. These sparse attention patterns are unique among the existing sparse attention and lead to an $O(N \log N)$ complexity, where $N$ is the number of tokens. Leveraging only 2-6% of the elements in the self-attention heads, Fibottention embedded into popular, state-of-the-art Transformer architectures can achieve significantly improved predictive performance for domains with limited data such as image classification, video understanding, and robot learning tasks, and render reduced computational complexity. We further validated the improved diversity of feature representations resulting from different self-attention heads, and our model design against other sparse attention mechanisms.
Related papers
- Unified CNNs and transformers underlying learning mechanism reveals multi-head attention modus vivendi [0.0]
Convolutional neural networks (CNNs) evaluate short-range correlations in input images which progress along the layers.
vision transformer (ViT) architectures evaluate long-range correlations, using repeated transformer encoders composed of fully connected layers.
This study demonstrates that CNNs and ViT architectures stem from a unified underlying learning mechanism.
arXiv Detail & Related papers (2025-01-22T14:19:48Z) - Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction [29.12836710966048]
We propose a novel transformer attention operator whose computational complexity scales linearly with the number of tokens.
Our results call into question the conventional wisdom that pairwise similarity style attention mechanisms are critical to the success of transformer architectures.
arXiv Detail & Related papers (2024-12-23T18:59:21Z) - DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers [13.480259378415505]
BiXT scales linearly with input size in terms of computational cost and memory consumption.
BiXT is inspired by the Perceiver architectures but replaces iterative attention with an efficient bi-directional cross-attention module.
By combining efficiency with the generality and performance of a full Transformer architecture, BiXT can process longer sequences.
arXiv Detail & Related papers (2024-02-19T13:38:15Z) - Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices.
Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z) - Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for
Long Sequences [16.066338004414092]
textitDiffuser is a new efficient Transformer for sequence-to-sequence modeling.
It incorporates all token interactions within one attention layer while maintaining low computation and memory costs.
We show its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective.
arXiv Detail & Related papers (2022-10-21T08:13:34Z) - Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks.
We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer.
The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z) - Adaptive Fourier Neural Operators: Efficient Token Mixers for
Transformers [55.90468016961356]
We propose an efficient token mixer that learns to mix in the Fourier domain.
AFNO is based on a principled foundation of operator learning.
It can handle a sequence size of 65k and outperforms other efficient self-attention mechanisms.
arXiv Detail & Related papers (2021-11-24T05:44:31Z) - XCiT: Cross-Covariance Image Transformers [73.33400159139708]
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens.
The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
arXiv Detail & Related papers (2021-06-17T17:33:35Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.