CAST: Clustering Self-Attention using Surrogate Tokens for Efficient
Transformers
- URL: http://arxiv.org/abs/2402.04239v1
- Date: Tue, 6 Feb 2024 18:47:52 GMT
- Title: CAST: Clustering Self-Attention using Surrogate Tokens for Efficient
Transformers
- Authors: Adjorn van Engelenhoven, Nicola Strisciuglio, Estefan\'ia Talavera
- Abstract summary: We propose a novel Clustering self-Attention mechanism using Surrogate Tokens (CAST), to optimize the attention and achieve efficient transformers.
CAST improves efficiency by reducing the complexity from $O(N2)$ to $O(alpha N)$ where N is the sequence length, and alpha is constant according to the number of clusters and samples per cluster.
- Score: 3.129187821625805
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Transformer architecture has shown to be a powerful tool for a wide range
of tasks. It is based on the self-attention mechanism, which is an inherently
computationally expensive operation with quadratic computational complexity:
memory usage and compute time increase quadratically with the length of the
input sequences, thus limiting the application of Transformers. In this work,
we propose a novel Clustering self-Attention mechanism using Surrogate Tokens
(CAST), to optimize the attention computation and achieve efficient
transformers. CAST utilizes learnable surrogate tokens to construct a cluster
affinity matrix, used to cluster the input sequence and generate novel cluster
summaries. The self-attention from within each cluster is then combined with
the cluster summaries of other clusters, enabling information flow across the
entire input sequence. CAST improves efficiency by reducing the complexity from
$O(N^2)$ to $O(\alpha N)$ where N is the sequence length, and {\alpha} is
constant according to the number of clusters and samples per cluster. We show
that CAST performs better than or comparable to the baseline Transformers on
long-range sequence modeling tasks, while also achieving higher results on time
and memory efficiency than other efficient transformers.
Related papers
- PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting [82.03373838627606]
Self-attention mechanism in Transformer architecture requires positional embeddings to encode temporal order in time series prediction.
We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences.
We present a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets.
arXiv Detail & Related papers (2024-08-20T01:56:07Z) - Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model [10.473819332984005]
We propose a segmented recurrent transformer (SRformer) that combines segmented (local) attention with recurrent attention.
The proposed model achieves $6-22%$ higher ROUGE1 scores than a segmented transformer and outperforms other recurrent transformer approaches.
arXiv Detail & Related papers (2023-05-24T03:47:22Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - Block-Recurrent Transformers [49.07682696216708]
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence.
Our recurrent cell operates on blocks of tokens rather than single tokens, and leverages parallel computation within a block in order to make efficient use of accelerator hardware.
arXiv Detail & Related papers (2022-03-11T23:44:33Z) - Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - End-to-End Object Detection with Adaptive Clustering Transformer [37.9114488933667]
A novel variant of Transformer named Adaptive Clustering Transformer(ACT) has been proposed to reduce the computation cost for high-resolution input.
ACT cluster the query features adaptively using Locality Sensitive Hashing (LSH) and ap-proximate the query-key interaction.
Code is released as supplementary for the ease of experiment replication and verification.
arXiv Detail & Related papers (2020-11-18T14:36:37Z) - Cluster-Former: Clustering-based Sparse Transformer for Long-Range
Dependency Encoding [90.77031668988661]
Cluster-Former is a novel clustering-based sparse Transformer to perform attention across chunked sequences.
The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer.
Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.
arXiv Detail & Related papers (2020-09-13T22:09:30Z) - Fast Transformers with Clustered Attention [14.448898156256478]
We propose clustered attention, which instead of computing the attention for every query, groups queries into clusters and computes attention just for the centroids.
This results in a model with linear complexity with respect to the sequence length for a fixed number of clusters.
We evaluate our approach on two automatic speech recognition datasets and show that our model consistently outperforms vanilla transformers for a given computational budget.
arXiv Detail & Related papers (2020-07-09T14:17:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.