Related papers: Fast Transformers with Clustered Attention

Fast Transformers with Clustered Attention

URL: http://arxiv.org/abs/2007.04825v2
Date: Tue, 29 Sep 2020 20:18:43 GMT
Title: Fast Transformers with Clustered Attention
Authors: Apoorv Vyas, Angelos Katharopoulos, Fran\c{c}ois Fleuret
Abstract summary: We propose clustered attention, which instead of computing the attention for every query, groups queries into clusters and computes attention just for the centroids. This results in a model with linear complexity with respect to the sequence length for a fixed number of clusters. We evaluate our approach on two automatic speech recognition datasets and show that our model consistently outperforms vanilla transformers for a given computational budget.
Score: 14.448898156256478
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers have been proven a successful model for a variety of tasks in sequence modeling. However, computing the attention matrix, which is their key component, has quadratic complexity with respect to the sequence length, thus making them prohibitively expensive for large sequences. To address this, we propose clustered attention, which instead of computing the attention for every query, groups queries into clusters and computes attention just for the centroids. To further improve this approximation, we use the computed clusters to identify the keys with the highest attention per query and compute the exact key/query dot products. This results in a model with linear complexity with respect to the sequence length for a fixed number of clusters. We evaluate our approach on two automatic speech recognition datasets and show that our model consistently outperforms vanilla transformers for a given computational budget. Finally, we demonstrate that our model can approximate arbitrarily complex attention distributions with a minimal number of clusters by approximating a pretrained BERT model on GLUE and SQuAD benchmarks with only 25 clusters and no loss in performance.

Related papers

A system identification approach to clustering vector autoregressive time series [50.66782357329375]
Clustering time series based on their underlying dynamics is keeping attracting researchers due to its impacts on assisting complex system modelling.<n>Most current time series clustering methods handle only scalar time series, treat them as white noise, or rely on domain knowledge for high-quality feature construction.<n>Instead of relying on feature/metric construction, the system identification approach allows treating vector time series clustering by explicitly considering their underlying autoregressive dynamics.
arXiv Detail & Related papers (2025-05-20T14:31:44Z)
Self Supervised Correlation-based Permutations for Multi-View Clustering [7.972599673048582]
We propose an end-to-end deep learning-based MVC framework for general data. Our approach involves learning meaningful fused data representations with a novel permutation-based canonical correlation objective. We demonstrate the effectiveness of our model using ten MVC benchmark datasets.
arXiv Detail & Related papers (2024-02-26T08:08:30Z)
CAST: Clustering Self-Attention using Surrogate Tokens for Efficient Transformers [3.129187821625805]
We propose a novel Clustering self-Attention mechanism using Surrogate Tokens (CAST), to optimize the attention and achieve efficient transformers. CAST improves efficiency by reducing the complexity from $O(N2)$ to $O(alpha N)$ where N is the sequence length, and alpha is constant according to the number of clusters and samples per cluster.
arXiv Detail & Related papers (2024-02-06T18:47:52Z)
An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z)
Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels. We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation. We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration. The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z)
Conditional set generation using Seq2seq models [52.516563721766445]
Conditional set generation learns a mapping from an input sequence of tokens to a set. Sequence-to-sequence(Seq2seq) models are a popular choice to model set generation. We propose a novel algorithm for effectively sampling informative orders over the space of label orders.
arXiv Detail & Related papers (2022-05-25T04:17:50Z)
Enhancing Latent Space Clustering in Multi-filter Seq2Seq Model: A Reinforcement Learning Approach [0.0]
We design a latent-enhanced multi-filter seq2seq model (LMS2S) that analyzes the latent space representations using a clustering algorithm. Our experiments on semantic parsing and machine translation demonstrate the positive correlation between the clustering quality and the model's performance.
arXiv Detail & Related papers (2021-09-25T16:36:31Z)
Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding [90.77031668988661]
Cluster-Former is a novel clustering-based sparse Transformer to perform attention across chunked sequences. The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer. Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.
arXiv Detail & Related papers (2020-09-13T22:09:30Z)
Autoencoder-based time series clustering with energy applications [0.0]
Time series clustering is a challenging task due to the specific nature of the data. In this paper we investigate the combination of a convolutional autoencoder and a k-medoids algorithm to perfom time series clustering.
arXiv Detail & Related papers (2020-02-10T10:04:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.