Toeplitz Neural Network for Sequence Modeling
- URL: http://arxiv.org/abs/2305.04749v1
- Date: Mon, 8 May 2023 14:49:01 GMT
- Title: Toeplitz Neural Network for Sequence Modeling
- Authors: Zhen Qin, Xiaodong Han, Weixuan Sun, Bowen He, Dong Li, Dongxu Li,
Yuchao Dai, Lingpeng Kong, Yiran Zhong
- Abstract summary: We show that a Toeplitz matrix-vector production trick can reduce the space-time complexity of the sequence modeling to log linear.
A lightweight sub-network called relative position encoder is proposed to generate relative position coefficients with a fixed budget of parameters.
Despite being trained on 512-token sequences, our model can extrapolate input sequence length up to 14K tokens in inference with consistent performance.
- Score: 46.04964190407727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sequence modeling has important applications in natural language processing
and computer vision. Recently, the transformer-based models have shown strong
performance on various sequence modeling tasks, which rely on attention to
capture pairwise token relations, and position embedding to inject positional
information. While showing good performance, the transformer models are
inefficient to scale to long input sequences, mainly due to the quadratic
space-time complexity of attention. To overcome this inefficiency, we propose
to model sequences with a relative position encoded Toeplitz matrix and use a
Toeplitz matrix-vector production trick to reduce the space-time complexity of
the sequence modeling to log linear. A lightweight sub-network called relative
position encoder is proposed to generate relative position coefficients with a
fixed budget of parameters, enabling the proposed Toeplitz neural network to
deal with varying sequence lengths. In addition, despite being trained on
512-token sequences, our model can extrapolate input sequence length up to 14K
tokens in inference with consistent performance. Extensive experiments on
autoregressive and bidirectional language modeling, image modeling, and the
challenging Long-Range Arena benchmark show that our method achieves better
performance than its competitors in most downstream tasks while being
significantly faster. The code is available at
https://github.com/OpenNLPLab/Tnn.
Related papers
- SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Efficient Time Series Processing for Transformers and State-Space Models through Token Merging [44.27818172708914]
Token merging has shown to considerably improve the throughput of vision transformer architectures.
We introduce local merging, a domain-specific token merging algorithm that selectively combines tokens within a local neighborhood.
On the recently proposed Chronos foundation model, we achieve accelerations up to 5400% with only minor accuracy degradations.
arXiv Detail & Related papers (2024-05-28T08:28:18Z) - Attention as an RNN [66.5420926480473]
We show that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its textitmany-to-one RNN output efficiently.
We introduce a new efficient method of computing attention's textitmany-to-many RNN output based on the parallel prefix scan algorithm.
We show Aarens achieve comparable performance to Transformers on $38$ datasets spread across four popular sequential problem settings.
arXiv Detail & Related papers (2024-05-22T19:45:01Z) - Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing.
These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators.
We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module.
We identify that a key weakness of such models is their inability to perform content-based reasoning.
We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba)
As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z) - Sequence Modeling with Multiresolution Convolutional Memory [27.218134279968062]
We build a new building block for sequence modeling called a MultiresLayer.
The key component of our model is the multiresolution convolution, capturing multiscale trends in the input sequence.
Our model yields state-of-the-art performance on a number of sequence classification and autoregressive density estimation tasks.
arXiv Detail & Related papers (2023-05-02T17:50:54Z) - Continuous-time convolutions model of event sequences [46.3471121117337]
Event sequences are non-uniform and sparse, making traditional models unsuitable.
We propose COTIC, a method based on an efficient convolution neural network designed to handle the non-uniform occurrence of events over time.
COTIC outperforms existing models in predicting the next event time and type, achieving an average rank of 1.5 compared to 3.714 for the nearest competitor.
arXiv Detail & Related papers (2023-02-13T10:34:51Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Learning to Encode Position for Transformer with Continuous Dynamical
Model [88.69870971415591]
We introduce a new way of learning to encode position information for non-recurrent models, such as Transformer models.
We model the evolution of encoded results along position index by such a dynamical system.
arXiv Detail & Related papers (2020-03-13T00:41:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.