Related papers: InfiniPot: Infinite Context Processing on Memory-Constrained LLMs

InfiniPot: Infinite Context Processing on Memory-Constrained LLMs

URL: http://arxiv.org/abs/2410.01518v1
Date: Wed, 2 Oct 2024 13:09:41 GMT
Title: InfiniPot: Infinite Context Processing on Memory-Constrained LLMs
Authors: Minsoo Kim, Kyuhong Shim, Jungwook Choi, Simyung Chang,
Abstract summary: InfiniPot is a novel KV cache control framework designed to enable pre-trained Large Language Models to manage extensive sequences efficiently. InfiniPot effectively maintains critical data even without access to future context. This work represents a substantial advancement toward making Large Language Models applicable to a broader range of real-world scenarios.
Score: 17.111422610001227
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Handling long input contexts remains a significant challenge for Large Language Models (LLMs), particularly in resource-constrained environments such as mobile devices. Our work aims to address this limitation by introducing InfiniPot, a novel KV cache control framework designed to enable pre-trained LLMs to manage extensive sequences within fixed memory constraints efficiently, without requiring additional training. InfiniPot leverages Continual Context Distillation (CCD), an iterative process that compresses and retains essential information through novel importance metrics, effectively maintaining critical data even without access to future context. Our comprehensive evaluations indicate that InfiniPot significantly outperforms models trained for long contexts in various NLP tasks, establishing its efficacy and versatility. This work represents a substantial advancement toward making LLMs applicable to a broader range of real-world scenarios.

Related papers

InfiniteICL: Breaking the Limit of Context Window Size via Long Short-term Memory Transformation [57.310236384112834]
In-context learning (ICL) is critical for large language models (LLMs) but its effectiveness is constrained by finite context windows. We introduce InfiniteICL, a framework that parallels context and parameters in LLMs with short- and long-term memory. We demonstrate that our method reduces context length by 90% while achieving 103% average performance of full-context prompting.
arXiv Detail & Related papers (2025-04-02T13:15:44Z)
Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing [19.577278316436807]
Large Language Models (LLMs) are limited by the context window size. We propose a novel method that leverages the LLMs's own attention information to enable accurate retrieval. InfiniRetri achieves 100% accuracy in the Needle-In-a-Haystack(NIH) test over 1M tokens using a 0.5B parameter model.
arXiv Detail & Related papers (2025-02-18T15:45:36Z)
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU [48.105361428245736]
We introduce InfiniteHiP, an inference framework for large language models (LLMs) We dynamically eliminate irrelevant context tokens through a modular hierarchical token pruning algorithm. Our framework achieves an 18.95x speedup in attention decoding for a 1 million token context without requiring additional training.
arXiv Detail & Related papers (2025-02-13T02:52:01Z)
ReAttention: Training-Free Infinite Context with Finite Attention Scope [65.91272939057592]
Long-context capability of Large Language Models (LLM) has made significant breakthroughs, but the maximum supported context length remains a critical bottleneck limiting their practical applications. We propose bftextReAttention, a training-free approach enabling LLM based on the self-attention mechanism to support an infinite context with a finite attention scope under sufficient memory resources. We validate the performance of ReAttention on the LongBench, L-Eval, and InfiniteBench and demonstrate that it is on par with traditional methods.
arXiv Detail & Related papers (2024-07-21T14:23:37Z)
KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches [52.02764371205856]
Long context capability is a crucial competency for large language models (LLMs) This work provides a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks.
arXiv Detail & Related papers (2024-07-01T17:59:47Z)
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management [0.5899781520375794]
Transformer-based large language models (LLMs) demonstrate impressive performance across various natural language processing tasks. serving inference for generating long contents poses a challenge due to the enormous memory footprint of the transient state. InfiniGen is a novel KV cache management framework tailored for long-text generation.
arXiv Detail & Related papers (2024-06-28T07:41:26Z)
On the Compressibility of Quantized Large Language Models [13.443384050034922]
Large Language Models (LLMs) are deployed on edge or mobile devices to offer enhanced data privacy and real-time processing capabilities. LLMs may still be too big to fit entirely into the limited memory of edge or mobile devices and have to be partially loaded from the storage to complete the inference. We study applying data compression techniques to reduce data movement and thus speed up the inference of quantized LLM on memory-constrained devices.
arXiv Detail & Related papers (2024-03-03T03:27:07Z)
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development. This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z)
InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory [93.20588235940453]
In this paper, we introduce a training-free memory-based method, InfLLM. InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention. Even when the sequence length is scaled to $1,024$K, InfLLM still effectively captures long-distance dependencies.
arXiv Detail & Related papers (2024-02-07T06:50:42Z)
Extending Context Window of Large Language Models via Semantic Compression [21.35020344956721]
Large Language Models (LLMs) often impose limitations on the length of the text input to ensure the generation of fluent and relevant responses. We propose a novel semantic compression method that enables generalization to texts 6-8 times longer, without incurring significant computational costs or requiring fine-tuning.
arXiv Detail & Related papers (2023-12-15T07:04:33Z)
RET-LLM: Towards a General Read-Write Memory for Large Language Models [53.288356721954514]
RET-LLM is a novel framework that equips large language models with a general write-read memory unit. Inspired by Davidsonian semantics theory, we extract and save knowledge in the form of triplets. Our framework exhibits robust performance in handling temporal-based question answering tasks.
arXiv Detail & Related papers (2023-05-23T17:53:38Z)
Online Continual Learning Without the Storage Constraint [67.66235695269839]
We contribute a simple algorithm, which updates a kNN classifier continually along with a fixed, pretrained feature extractor. It can adapt to rapidly changing streams, has zero stability gap, operates within tiny computational budgets, has low storage requirements by only storing features. It can outperform existing methods by over 20% in accuracy on two large-scale online continual learning datasets.
arXiv Detail & Related papers (2023-05-16T08:03:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.