Centroid Transformers: Learning to Abstract with Attention
- URL: http://arxiv.org/abs/2102.08606v1
- Date: Wed, 17 Feb 2021 07:04:19 GMT
- Title: Centroid Transformers: Learning to Abstract with Attention
- Authors: Lemeng Wu, Xingchao Liu, Qiang Liu
- Abstract summary: Self-attention is a powerful mechanism for extracting features from the inputs.
We propose centroid attention, a generalization of self-attention that maps N inputs to M outputs $(Mleq N)$.
We apply our method to various applications, including abstractive text summarization, 3D vision, and image processing.
- Score: 15.506293166377182
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-attention, as the key block of transformers, is a powerful mechanism for
extracting features from the inputs. In essence, what self-attention does to
infer the pairwise relations between the elements of the inputs, and modify the
inputs by propagating information between input pairs. As a result, it maps
inputs to N outputs and casts a quadratic $O(N^2)$ memory and time complexity.
We propose centroid attention, a generalization of self-attention that maps N
inputs to M outputs $(M\leq N)$, such that the key information in the inputs
are summarized in the smaller number of outputs (called centroids). We design
centroid attention by amortizing the gradient descent update rule of a
clustering objective function on the inputs, which reveals an underlying
connection between attention and clustering. By compressing the inputs to the
centroids, we extract the key information useful for prediction and also reduce
the computation of the attention module and the subsequent layers. We apply our
method to various applications, including abstractive text summarization, 3D
vision, and image processing. Empirical results demonstrate the effectiveness
of our method over the standard transformers.
Related papers
- DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - Learning Feature Matching via Matchable Keypoint-Assisted Graph Neural
Network [52.29330138835208]
Accurately matching local features between a pair of images is a challenging computer vision task.
Previous studies typically use attention based graph neural networks (GNNs) with fully-connected graphs over keypoints within/across images.
We propose MaKeGNN, a sparse attention-based GNN architecture which bypasses non-repeatable keypoints and leverages matchable ones to guide message passing.
arXiv Detail & Related papers (2023-07-04T02:50:44Z) - How Much Does Attention Actually Attend? Questioning the Importance of
Attention in Pretrained Transformers [59.57128476584361]
We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones.
We find that without any input-dependent attention, all models achieve competitive performance.
We show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
arXiv Detail & Related papers (2022-11-07T12:37:54Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Rethinking Query-Key Pairwise Interactions in Vision Transformers [5.141895475956681]
We propose key-only attention, which excludes query-key pairwise interactions and uses a compute-efficient saliency-gate to obtain attention weights.
We develop a new self-attention model family, LinGlos, which reach state-of-the-art accuracies on the parameter-limited setting of ImageNet classification benchmark.
arXiv Detail & Related papers (2022-07-01T03:36:49Z) - Rotate to Attend: Convolutional Triplet Attention Module [21.228370317693244]
We present triplet attention, a novel method for computing attention weights using a three-branch structure.
Our method is simple as well as efficient and can be easily plugged into classic backbone networks as an add-on module.
We demonstrate the effectiveness of our method on various challenging tasks including image classification on ImageNet-1k and object detection on MSCOCO and PASCAL VOC datasets.
arXiv Detail & Related papers (2020-10-06T21:31:00Z) - Quantifying Attention Flow in Transformers [12.197250533100283]
"self-attention" combines information from attended embeddings into the representation of the focal embedding in the next layer.
This makes attention weights unreliable as explanations probes.
We propose two methods for approximating the attention to input tokens given attention weights, attention rollout and attention flow.
arXiv Detail & Related papers (2020-05-02T21:45:27Z) - Self-Attention Attribution: Interpreting Information Interactions Inside
Transformer [89.21584915290319]
We propose a self-attention attribution method to interpret the information interactions inside Transformer.
We show that the attribution results can be used as adversarial patterns to implement non-targeted attacks towards BERT.
arXiv Detail & Related papers (2020-04-23T14:58:22Z) - FAIRS -- Soft Focus Generator and Attention for Robust Object
Segmentation from Extreme Points [70.65563691392987]
We present a new approach to generate object segmentation from user inputs in the form of extreme points and corrective clicks.
We demonstrate our method's ability to generate high-quality training data as well as its scalability in incorporating extreme points, guiding clicks, and corrective clicks in a principled manner.
arXiv Detail & Related papers (2020-04-04T22:25:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.