Related papers: LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models

LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models

URL: http://arxiv.org/abs/2308.16137v7
Date: Mon, 24 Jun 2024 21:22:00 GMT
Title: LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models
Authors: Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, Sinong Wang,
Abstract summary: Large language models (LLMs) typically train on short text segments (e.g., 4K tokens) due to the quadratic complexity of their Transformer architectures. This work identifies three major factors contributing to this length generalization failure. We propose LM-Infinite, a simple and effective method for enhancing LLMs' capabilities of handling long contexts.
Score: 83.98062659664785
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Today's large language models (LLMs) typically train on short text segments (e.g., <4K tokens) due to the quadratic complexity of their Transformer architectures. As a result, their performance suffers drastically on inputs longer than those encountered during training, substantially limiting their applications in real-world tasks involving long contexts such as encoding scientific articles, code repositories, or long dialogues. Through theoretical analysis and empirical investigation, this work identifies three major factors contributing to this length generalization failure. Our theoretical analysis further reveals that commonly used techniques like truncating the attention window or relative positional encodings are inadequate to address them. Answering these challenges, we propose LM-Infinite, a simple and effective method for enhancing LLMs' capabilities of handling long contexts. LM-Infinite is highly flexible and can be used with most modern LLMs off-the-shelf. Without any parameter updates, it allows LLMs pre-trained with 2K or 4K-long segments to generalize to up to 200M length inputs while retaining perplexity. It also improves performance on downstream tasks such as Passkey Retrieval and Qasper in the zero-shot setting. LM-Infinite brings substantial efficiency improvements: it achieves 2.7x decoding speed up and 7.5x memory saving over the original model. Our codes are released at \url{https://github.com/Glaciohound/LM-Infinite}.

Related papers

Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs. We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing [19.577278316436807]
Large Language Models (LLMs) are limited by the context window size. We propose a novel method that leverages the LLMs's own attention information to enable accurate retrieval. InfiniRetri achieves 100% accuracy in the Needle-In-a-Haystack(NIH) test over 1M tokens using a 0.5B parameter model.
arXiv Detail & Related papers (2025-02-18T15:45:36Z)
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU [48.105361428245736]
We introduce InfiniteHiP, an inference framework for large language models (LLMs) We dynamically eliminate irrelevant context tokens through a modular hierarchical token pruning algorithm. Our framework achieves an 18.95x speedup in attention decoding for a 1 million token context without requiring additional training.
arXiv Detail & Related papers (2025-02-13T02:52:01Z)
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution [87.3259169631789]
Nearest Speculative Decoding (NEST) is capable of incorporating real-world text spans of arbitrary length into the LM generations and providing attribution to their sources. NEST significantly enhances the generation quality and attribution rate of the base LM across a variety of knowledge-intensive tasks. In addition, NEST substantially improves the generation speed, achieving a 1.8x speedup in inference time when applied to Llama-2-Chat 70B.
arXiv Detail & Related papers (2024-05-29T17:55:03Z)
SirLLM: Streaming Infinite Retentive LLM [74.40196814292426]
Large Language Models (LLMs) process inputs of any length and maintain a degree of memory. Recent efforts have employed streaming inputs to alleviate the pressure of excessively long text inputs. We introduce Streaming Infinite Retentive LLM (SirLLM), which allows LLMs to maintain longer memory during infinite-length dialogues.
arXiv Detail & Related papers (2024-05-21T06:37:03Z)
Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models. HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks. A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z)
Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding [15.723047976314751]
Large language models (LLMs) have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following. We propose a hybrid approach that combines language models of different sizes to increase the efficiency of autoregressive decoding.
arXiv Detail & Related papers (2024-02-26T18:59:28Z)
InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory [93.20588235940453]
In this paper, we introduce a training-free memory-based method, InfLLM. InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention. Even when the sequence length is scaled to $1,024$K, InfLLM still effectively captures long-distance dependencies.
arXiv Detail & Related papers (2024-02-07T06:50:42Z)
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding [27.87483106859749]
Lookahead decoding is an exact, parallel decoding algorithm for large language models (LLMs) Our implementation can speed up autoregressive decoding by up to 1.8x on MT-bench and 4x with strong scaling on multiple GPUs in code completion tasks.
arXiv Detail & Related papers (2024-02-03T06:37:50Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
Efficient Long-Text Understanding with Short-Text Models [38.8375175429553]
SLED is a simple approach for processing long sequences that re-uses and leverages battle-tested short-text pretrained LMs. We partition the input into overlapping chunks, encode each with a short-text LM encoder and use the pretrained decoder to fuse information across chunks. We find that SLED is competitive with specialized models that are up to 50x larger and require a dedicated and expensive pretraining step.
arXiv Detail & Related papers (2022-08-01T11:14:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.