Cluster-Former: Clustering-based Sparse Transformer for Long-Range
Dependency Encoding
- URL: http://arxiv.org/abs/2009.06097v2
- Date: Mon, 7 Jun 2021 06:08:27 GMT
- Title: Cluster-Former: Clustering-based Sparse Transformer for Long-Range
Dependency Encoding
- Authors: Shuohang Wang, Luowei Zhou, Zhe Gan, Yen-Chun Chen, Yuwei Fang, Siqi
Sun, Yu Cheng, Jingjing Liu
- Abstract summary: Cluster-Former is a novel clustering-based sparse Transformer to perform attention across chunked sequences.
The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer.
Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.
- Score: 90.77031668988661
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer has become ubiquitous in the deep learning field. One of the key
ingredients that destined its success is the self-attention mechanism, which
allows fully-connected contextual encoding over input tokens. However, despite
its effectiveness in modeling short sequences, self-attention suffers when
handling inputs with extreme long-range dependencies, as its complexity grows
quadratically with respect to the sequence length. Therefore, long sequences
are often encoded by Transformer in chunks using a sliding window. In this
paper, we propose Cluster-Former, a novel clustering-based sparse Transformer
to perform attention across chunked sequences. The proposed framework is
pivoted on two unique types of Transformer layer: Sliding-Window Layer and
Cluster-Former Layer, which encode local sequence information and global
context jointly and iteratively. This new design allows information integration
beyond local windows, which is especially beneficial for question answering
(QA) tasks that rely on long-range dependencies. Experiments show that
Cluster-Former achieves state-of-the-art performance on several major QA
benchmarks.
Related papers
- PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting [82.03373838627606]
Self-attention mechanism in Transformer architecture requires positional embeddings to encode temporal order in time series prediction.
We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences.
We present a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets.
arXiv Detail & Related papers (2024-08-20T01:56:07Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - ConvTimeNet: A Deep Hierarchical Fully Convolutional Model for
Multivariate Time Series Analysis [8.560776357590088]
ConvTimeNet is a novel deep hierarchical fully convolutional network designed to serve as a general-purpose model for time series analysis.
The results consistently outperformed strong baselines in most situations in terms of effectiveness.
arXiv Detail & Related papers (2024-03-03T12:05:49Z) - CAST: Clustering Self-Attention using Surrogate Tokens for Efficient
Transformers [3.129187821625805]
We propose a novel Clustering self-Attention mechanism using Surrogate Tokens (CAST), to optimize the attention and achieve efficient transformers.
CAST improves efficiency by reducing the complexity from $O(N2)$ to $O(alpha N)$ where N is the sequence length, and alpha is constant according to the number of clusters and samples per cluster.
arXiv Detail & Related papers (2024-02-06T18:47:52Z) - Fovea Transformer: Efficient Long-Context Modeling with Structured
Fine-to-Coarse Attention [17.48544285026157]
We introduce Fovea Transformer, a long-context focused transformer.
We use representations of context tokens with a progressively coarser granularity in the tree, as their distance to the query token increases.
We evaluate our model on three long-context summarization tasks.
arXiv Detail & Related papers (2023-11-13T06:24:27Z) - FormerTime: Hierarchical Multi-Scale Representations for Multivariate
Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task.
It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Efficient Long Sequence Encoding via Synchronization [29.075962393432857]
We propose a synchronization mechanism for hierarchical encoding.
Our approach first identifies anchor tokens across segments and groups them by their roles in the original input sequence.
Our approach is able to improve the global information exchange among segments while maintaining efficiency.
arXiv Detail & Related papers (2022-03-15T04:37:02Z) - Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.