Associative Transformer
- URL: http://arxiv.org/abs/2309.12862v3
- Date: Wed, 31 Jan 2024 01:05:14 GMT
- Title: Associative Transformer
- Authors: Yuwei Sun, Hideya Ochiai, Zhirong Wu, Stephen Lin, Ryota Kanai
- Abstract summary: We propose Associative Transformer (AiT) to enhance the association among sparsely attended input patches.
AiT requires significantly fewer parameters and attention layers while outperforming Vision Transformers and a broad range of sparse Transformers.
- Score: 26.967506484952214
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Emerging from the pairwise attention in conventional Transformers, there is a
growing interest in sparse attention mechanisms that align more closely with
localized, contextual learning in the biological brain. Existing studies such
as the Coordination method employ iterative cross-attention mechanisms with a
bottleneck to enable the sparse association of inputs. However, these methods
are parameter inefficient and fail in more complex relational reasoning tasks.
To this end, we propose Associative Transformer (AiT) to enhance the
association among sparsely attended input patches, improving parameter
efficiency and performance in relational reasoning tasks. AiT leverages a
learnable explicit memory, comprised of various specialized priors, with a
bottleneck attention to facilitate the extraction of diverse localized
features. Moreover, we propose a novel associative memory-enabled patch
reconstruction with a Hopfield energy function. The extensive experiments in
four image classification tasks with three different sizes of AiT demonstrate
that AiT requires significantly fewer parameters and attention layers while
outperforming Vision Transformers and a broad range of sparse Transformers.
Additionally, AiT establishes new SOTA performance in the Sort-of-CLEVR
dataset, outperforming the previous Coordination method.
Related papers
- Mixture of Attention Yields Accurate Results for Tabular Data [21.410818837489973]
We propose MAYA, an encoder-decoder transformer-based framework.
In the encoder, we design a Mixture of Attention (MOA) that constructs multiple parallel attention branches.
We employ collaborative learning with a dynamic consistency weight constraint to produce more robust representations.
arXiv Detail & Related papers (2025-02-18T03:43:42Z) - Toward Relative Positional Encoding in Spiking Transformers [52.62008099390541]
Spiking neural networks (SNNs) are bio-inspired networks that model how neurons in the brain communicate through discrete spikes.
In this paper, we introduce an approximate method for relative positional encoding (RPE) in Spiking Transformers.
arXiv Detail & Related papers (2025-01-28T06:42:37Z) - Transforming Vision Transformer: Towards Efficient Multi-Task Asynchronous Learning [59.001091197106085]
Multi-Task Learning (MTL) for Vision Transformer aims at enhancing the model capability by tackling multiple tasks simultaneously.
Most recent works have predominantly focused on designing Mixture-of-Experts (MoE) structures and in tegrating Low-Rank Adaptation (LoRA) to efficiently perform multi-task learning.
We propose a novel approach dubbed Efficient Multi-Task Learning (EMTAL) by transforming a pre-trained Vision Transformer into an efficient multi-task learner.
arXiv Detail & Related papers (2025-01-12T17:41:23Z) - Scaled and Inter-token Relation Enhanced Transformer for Sample-restricted Residential NILM [0.0]
We propose a novel transformer architecture with two key innovations: inter-token relation enhancement and dynamic temperature tuning.
We validate our method on the REDD dataset and show that it outperforms the original transformer and state-of-the-art models by 10-15% in F1 score across various appliance types.
arXiv Detail & Related papers (2024-10-12T18:58:45Z) - DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers [56.264673865476986]
This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models.
SLA improves the model's ability to capture dependencies between high-level abstract features and low-level details.
Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer.
arXiv Detail & Related papers (2024-06-17T07:24:38Z) - FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification.
Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z) - Correlated Attention in Transformers for Multivariate Time Series [22.542109523780333]
We propose a novel correlated attention mechanism, which efficiently captures feature-wise dependencies, and can be seamlessly integrated within the encoder blocks of existing Transformers.
In particular, correlated attention operates across feature channels to compute cross-covariance matrices between queries and keys with different lag values, and selectively aggregate representations at the sub-series level.
This architecture facilitates automated discovery and representation learning of not only instantaneous but also lagged cross-correlations, while inherently capturing time series auto-correlation.
arXiv Detail & Related papers (2023-11-20T17:35:44Z) - The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in
Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.
We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks.
We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z) - Knowledge Amalgamation for Object Detection with Transformers [36.7897364648987]
Knowledge amalgamation (KA) is a novel deep model reusing task aiming to transfer knowledge from several well-trained teachers to a compact student.
We propose to dissolve the KA into two aspects: sequence-level amalgamation (SA) and task-level amalgamation (TA)
In particular, a hint is generated within the sequence-level amalgamation by concatenating teacher sequences instead of redundantly aggregating them to a fixed-size one.
arXiv Detail & Related papers (2022-03-07T07:45:22Z) - Semantic Correspondence with Transformers [68.37049687360705]
We propose Cost Aggregation with Transformers (CATs) to find dense correspondences between semantically similar images.
We include appearance affinity modelling to disambiguate the initial correlation maps and multi-level aggregation.
We conduct experiments to demonstrate the effectiveness of the proposed model over the latest methods and provide extensive ablation studies.
arXiv Detail & Related papers (2021-06-04T14:39:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.