Interactive Multi-Head Self-Attention with Linear Complexity
- URL: http://arxiv.org/abs/2402.17507v1
- Date: Tue, 27 Feb 2024 13:47:23 GMT
- Title: Interactive Multi-Head Self-Attention with Linear Complexity
- Authors: Hankyul Kang, Ming-Hsuan Yang, Jongbin Ryu
- Abstract summary: We show that the interactions between cross-heads of the attention matrix enhance the information flow of the attention operation.
We propose an effective method to decompose the attention operation into query- and key-less components.
- Score: 60.112941134420204
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose an efficient interactive method for multi-head self-attention via
decomposition. For existing methods using multi-head self-attention, the
attention operation of each head is computed independently. However, we show
that the interactions between cross-heads of the attention matrix enhance the
information flow of the attention operation. Considering that the attention
matrix of each head can be seen as a feature of networks, it is beneficial to
establish connectivity between them to capture interactions better. However, a
straightforward approach to capture the interactions between the cross-heads is
computationally prohibitive as the complexity grows substantially with the high
dimension of an attention matrix. In this work, we propose an effective method
to decompose the attention operation into query- and key-less components. This
will result in a more manageable size for the attention matrix, specifically
for the cross-head interactions. Expensive experimental results show that the
proposed cross-head interaction approach performs favorably against existing
efficient attention methods and state-of-the-art backbone models.
Related papers
- Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Interactive Spatiotemporal Token Attention Network for Skeleton-based
General Interactive Action Recognition [8.513434732050749]
We propose an Interactive Spatiotemporal Token Attention Network (ISTA-Net), which simultaneously model spatial, temporal, and interactive relations.
Our network contains a tokenizer to partition Interactive Spatiotemporal Tokens (ISTs), which is a unified way to represent motions of multiple diverse entities.
To jointly learn along three dimensions in ISTs, multi-head self-attention blocks integrated with 3D convolutions are designed to capture inter-token correlations.
arXiv Detail & Related papers (2023-07-14T16:51:25Z) - Boundary-aware Supervoxel-level Iteratively Refined Interactive 3D Image
Segmentation with Multi-agent Reinforcement Learning [33.181732857907384]
We propose to model interactive image segmentation with a Markov decision process (MDP) and solve it with reinforcement learning (RL)
Considering the large exploration space for voxel-wise prediction, multi-agent reinforcement learning is adopted, where the voxel-level policy is shared among agents.
Experimental results on four benchmark datasets have shown that the proposed method significantly outperforms the state-of-the-arts.
arXiv Detail & Related papers (2023-03-19T15:52:56Z) - Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space
Using Joint Cross-Attention [15.643176705932396]
We introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities.
It computes the cross-attention weights based on correlation between the joint feature representation and that of the individual modalities.
Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches.
arXiv Detail & Related papers (2022-09-19T15:01:55Z) - Alignment Attention by Matching Key and Query Distributions [48.93793773929006]
This paper introduces alignment attention that explicitly encourages self-attention to match the distributions of the key and query within each head.
It is simple to convert any models with self-attention, including pre-trained ones, to the proposed alignment attention.
On a variety of language understanding tasks, we show the effectiveness of our method in accuracy, uncertainty estimation, generalization across domains, and robustness to adversarial attacks.
arXiv Detail & Related papers (2021-10-25T00:54:57Z) - Beyond Self-attention: External Attention using Two Linear Layers for
Visual Tasks [34.32609892928909]
We propose a novel attention mechanism which we call external attention, based on two external, small, learnable, and shared memories.
Our method provides comparable or superior performance to the self-attention mechanism and some of its variants, with much lower computational and memory costs.
arXiv Detail & Related papers (2021-05-05T22:29:52Z) - Context-Aware Interaction Network for Question Matching [51.76812857301819]
We propose a context-aware interaction network (COIN) to align two sequences and infer their semantic relationship.
Specifically, each interaction block includes (1) a context-aware cross-attention mechanism to effectively integrate contextual information, and (2) a gate fusion layer to flexibly interpolate aligned representations.
arXiv Detail & Related papers (2021-04-17T05:03:56Z) - Collaborative Attention Mechanism for Multi-View Action Recognition [75.33062629093054]
We propose a collaborative attention mechanism (CAM) for solving the multi-view action recognition problem.
The proposed CAM detects the attention differences among multi-view, and adaptively integrates frame-level information to benefit each other.
Experiments on four action datasets illustrate the proposed CAM achieves better results for each view and also boosts multi-view performance.
arXiv Detail & Related papers (2020-09-14T17:33:10Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.