Associative Transformer
- URL: http://arxiv.org/abs/2309.12862v3
- Date: Wed, 31 Jan 2024 01:05:14 GMT
- Title: Associative Transformer
- Authors: Yuwei Sun, Hideya Ochiai, Zhirong Wu, Stephen Lin, Ryota Kanai
- Abstract summary: We propose Associative Transformer (AiT) to enhance the association among sparsely attended input patches.
AiT requires significantly fewer parameters and attention layers while outperforming Vision Transformers and a broad range of sparse Transformers.
- Score: 26.967506484952214
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Emerging from the pairwise attention in conventional Transformers, there is a
growing interest in sparse attention mechanisms that align more closely with
localized, contextual learning in the biological brain. Existing studies such
as the Coordination method employ iterative cross-attention mechanisms with a
bottleneck to enable the sparse association of inputs. However, these methods
are parameter inefficient and fail in more complex relational reasoning tasks.
To this end, we propose Associative Transformer (AiT) to enhance the
association among sparsely attended input patches, improving parameter
efficiency and performance in relational reasoning tasks. AiT leverages a
learnable explicit memory, comprised of various specialized priors, with a
bottleneck attention to facilitate the extraction of diverse localized
features. Moreover, we propose a novel associative memory-enabled patch
reconstruction with a Hopfield energy function. The extensive experiments in
four image classification tasks with three different sizes of AiT demonstrate
that AiT requires significantly fewer parameters and attention layers while
outperforming Vision Transformers and a broad range of sparse Transformers.
Additionally, AiT establishes new SOTA performance in the Sort-of-CLEVR
dataset, outperforming the previous Coordination method.
Related papers
- Scaled and Inter-token Relation Enhanced Transformer for Sample-restricted Residential NILM [0.0]
We propose two novel mechanisms to enhance the attention mechanism of the original transformer to improve performance.
The first mechanism reduces the prioritization of intra-token relationships in the token similarity matrix during training, thereby increasing inter-token focus.
The second mechanism introduces a learnable temperature tuning for the token similarity matrix, mitigating the over-smoothing problem associated with fixed temperature values.
arXiv Detail & Related papers (2024-10-12T18:58:45Z) - DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers [56.264673865476986]
This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models.
SLA improves the model's ability to capture dependencies between high-level abstract features and low-level details.
Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer.
arXiv Detail & Related papers (2024-06-17T07:24:38Z) - FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification.
Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z) - Correlated Attention in Transformers for Multivariate Time Series [22.542109523780333]
We propose a novel correlated attention mechanism, which efficiently captures feature-wise dependencies, and can be seamlessly integrated within the encoder blocks of existing Transformers.
In particular, correlated attention operates across feature channels to compute cross-covariance matrices between queries and keys with different lag values, and selectively aggregate representations at the sub-series level.
This architecture facilitates automated discovery and representation learning of not only instantaneous but also lagged cross-correlations, while inherently capturing time series auto-correlation.
arXiv Detail & Related papers (2023-11-20T17:35:44Z) - The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in
Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.
We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks.
We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - Knowledge Amalgamation for Object Detection with Transformers [36.7897364648987]
Knowledge amalgamation (KA) is a novel deep model reusing task aiming to transfer knowledge from several well-trained teachers to a compact student.
We propose to dissolve the KA into two aspects: sequence-level amalgamation (SA) and task-level amalgamation (TA)
In particular, a hint is generated within the sequence-level amalgamation by concatenating teacher sequences instead of redundantly aggregating them to a fixed-size one.
arXiv Detail & Related papers (2022-03-07T07:45:22Z) - Short Range Correlation Transformer for Occluded Person
Re-Identification [4.339510167603376]
We propose a partial feature transformer-based person re-identification framework named PFT.
The proposed PFT utilizes three modules to enhance the efficiency of vision transformer.
Experimental results over occluded and holistic re-identification datasets demonstrate that the proposed PFT network achieves superior performance consistently.
arXiv Detail & Related papers (2022-01-04T11:12:39Z) - Augmented Shortcuts for Vision Transformers [49.70151144700589]
We study the relationship between shortcuts and feature diversity in vision transformer models.
We present an augmented shortcut scheme, which inserts additional paths with learnable parameters in parallel on the original shortcuts.
Experiments conducted on benchmark datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2021-06-30T09:48:30Z) - Semantic Correspondence with Transformers [68.37049687360705]
We propose Cost Aggregation with Transformers (CATs) to find dense correspondences between semantically similar images.
We include appearance affinity modelling to disambiguate the initial correlation maps and multi-level aggregation.
We conduct experiments to demonstrate the effectiveness of the proposed model over the latest methods and provide extensive ablation studies.
arXiv Detail & Related papers (2021-06-04T14:39:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.