Related papers: Unlimiformer: Long-Range Transformers with Unlimited Length Input

Unlimiformer: Long-Range Transformers with Unlimited Length Input

URL: http://arxiv.org/abs/2305.01625v3
Date: Mon, 30 Oct 2023 19:44:47 GMT
Title: Unlimiformer: Long-Range Transformers with Unlimited Length Input
Authors: Amanda Bertsch, Uri Alon, Graham Neubig, Matthew R. Gormley
Abstract summary: Unlimiformer is a general approach that wraps any existing pretrained encoder-decoder transformer. It offloads the cross-attention computation to a single k-nearest-neighbor (kNN) index. We show that Unlimiformer can process even 500k token-long inputs from the BookSum dataset, without any input truncation at test time.
Score: 67.04942180004805
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Since the proposal of transformers, these models have been limited to bounded input lengths, because of their need to attend to every token in the input. In this work, we propose Unlimiformer: a general approach that wraps any existing pretrained encoder-decoder transformer, and offloads the cross-attention computation to a single k-nearest-neighbor (kNN) index, while the returned kNN distances are the attention dot-product scores. This kNN index can be kept on either the GPU or CPU memory and queried in sub-linear time; this way, we can index practically unlimited input sequences, while every attention head in every decoder layer retrieves its top-k keys, instead of attending to every key. We evaluate Unlimiformer on several long-document and book-summarization benchmarks, showing that it can process even 500k token-long inputs from the BookSum dataset, without any input truncation at test time. We demonstrate that Unlimiformer improves pretrained models such as BART and Longformer by extending them to unlimited inputs without additional learned weights and without modifying their code. We make our code and models publicly available at https://github.com/abertsch72/unlimiformer .

Related papers

End-to-End Long Document Summarization using Gradient Caching [16.52198368672941]
Training transformer-based encoder-decoder models for long document summarization poses a significant challenge. We propose CachED (Gradient $textbfCach$ing for $textbfE$ncoder-$textbfD$ecoder models), an approach that enables end-to-end training of existing transformer-based encoder-decoder models.
arXiv Detail & Related papers (2025-01-03T13:32:57Z)
RefreshKV: Updating Small KV Cache During Long-form Generation [54.00118604124301]
We propose a new inference method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation. Applying our method to off-the-shelf LLMs achieves comparable speedup to eviction-based methods while improving performance for various long-form generation tasks.
arXiv Detail & Related papers (2024-11-08T18:57:07Z)
Equipping Transformer with Random-Access Reading for Long-Context Understanding [9.433800833564279]
Long-context modeling presents a significant challenge for transformer-based large language models. We propose a novel reading strategy that enables transformers to efficiently process long documents without examining every token.
arXiv Detail & Related papers (2024-05-21T21:41:07Z)
Skipformer: A Skip-and-Recover Strategy for Efficient Speech Recognition [7.963605445905696]
Conformer-based attention models have become the de facto backbone model for Automatic Speech Recognition tasks. We propose a "Skip-and-Recover" Conformer architecture, named Skipformer, to squeeze sequence input length dynamically and inhomogeneously. Our model reduces the input sequence length by 31 times on Aishell-1 and 22 times on Librispeech corpus.
arXiv Detail & Related papers (2024-03-13T05:20:45Z)
Continuous-time Autoencoders for Regular and Irregular Time Series Imputation [21.25279298572273]
Time series imputation is one of the most fundamental tasks for time series. Recent self-attention-based methods show the state-of-the-art imputation performance. It has been overlooked for a long time to design an imputation method based on continuous-time recurrent neural networks.
arXiv Detail & Related papers (2023-12-27T14:13:42Z)
Memory-efficient Transformers via Top-$k$ Attention [23.672065688109395]
In this work, we propose a simple yet highly accurate approximation for vanilla attention. We process the queries in chunks, and for each query, compute the top-$k$ scores with respect to the keys. We show our approach leads to accuracy that is nearly-identical to vanilla attention in multiple setups including training from scratch, fine-tuning, and zero-shot inference.
arXiv Detail & Related papers (2021-06-13T02:30:23Z)
Adaptive Nearest Neighbor Machine Translation [60.97183408140499]
kNN-MT combines pre-trained neural machine translation with token-level k-nearest-neighbor retrieval. Traditional kNN algorithm simply retrieves a same number of nearest neighbors for each target token. We propose Adaptive kNN-MT to dynamically determine the number of k for each target token.
arXiv Detail & Related papers (2021-05-27T09:27:42Z)
FSR: Accelerating the Inference Process of Transducer-Based Models by Applying Fast-Skip Regularization [72.9385528828306]
A typical transducer model decodes the output sequence conditioned on the current acoustic state. The number of blank tokens in the prediction results accounts for nearly 90% of all tokens. We propose a method named fast-skip regularization, which tries to align the blank position predicted by a transducer with that predicted by a CTC model.
arXiv Detail & Related papers (2021-04-07T03:15:10Z)
Nystr\"omformer: A Nystr\"om-Based Algorithm for Approximating Self-Attention [60.043273122786005]
We propose Nystr"omformer -- a model that exhibits favorable scalability as a function of sequence length. The scalability of Nystr"omformer enables application to longer sequences with thousands of tokens. We perform evaluations on multiple downstream tasks on the GLUE benchmark and reviews with standard sequence length, and find that our Nystr"omformer performs comparably, or in a few cases, even slightly better, than standard Transformer.
arXiv Detail & Related papers (2021-02-07T20:06:59Z)
Learning to Encode Position for Transformer with Continuous Dynamical Model [88.69870971415591]
We introduce a new way of learning to encode position information for non-recurrent models, such as Transformer models. We model the evolution of encoded results along position index by such a dynamical system.
arXiv Detail & Related papers (2020-03-13T00:41:41Z)
Pruning Neural Belief Propagation Decoders [77.237958592189]
We introduce a method to tailor an overcomplete parity-check matrix to (neural) BP decoding using machine learning. We achieve performance within 0.27 dB and 1.5 dB of the ML performance while reducing the complexity of the decoder.
arXiv Detail & Related papers (2020-01-21T12:05:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.