Related papers: UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs

UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs

URL: http://arxiv.org/abs/2406.18173v1
Date: Wed, 26 Jun 2024 08:44:36 GMT
Title: UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs
Authors: Wenhao Li, Mingbao Lin, Yunshan Zhong, Shuicheng Yan, Rongrong Ji,
Abstract summary: UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings. We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm. UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
Score: 111.12010207132204
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Managing long texts is challenging for large language models (LLMs) due to limited context window sizes. This study introduces UIO-LLMs, an unbiased incremental optimization approach for memory-enhanced transformers under long-context settings. We initially conceptualize the process as a streamlined encoder-decoder framework where the weights-shared encoder and decoder respectively encapsulate a context segment into memories and leverage these memories to predict outputs of the subsequent segment. Subsequently, by treating our memory-enhanced transformers as fully-connected recurrent neural networks (RNNs), we refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm, which incorporates innovative incremental optimization techniques. These techniques not only diminish time complexity but also address the bias in gradient computation through an unbiased optimization process. UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters, while keeping the inference cost nearly linear as context length increases.

Related papers

TNT: Improving Chunkwise Training for Test-Time Memorization [62.78875147721906]
Recurrent neural networks (RNNs) with deep test-time memorization modules, such as Titans and TTT, represent a promising, linearly-scaling paradigm distinct from Transformers.<n>We introduce TNT, a novel training paradigm that decouples training efficiency from inference performance through a two-stage process.<n>TNT achieves a substantial acceleration in training speed-up to 17 times faster than the most accurate baseline configuration.
arXiv Detail & Related papers (2025-11-10T17:45:09Z)
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling [59.088798018184235]
textbfRAPO++ is a cross-stage prompt optimization framework.<n>It unifies training-data-aligned refinement, test-time iterative scaling, and large language model fine-tuning.<n> RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility.
arXiv Detail & Related papers (2025-10-23T04:45:09Z)
Modality Agnostic Efficient Long Range Encoder [14.705955027331674]
We address the challenge of long-context processing on a single device using generic implementations.<n>To overcome these limitations, we propose MAELRE, a unified and efficient transformer architecture.<n>We demonstrate that MAELRE achieves superior accuracy while reducing computational cost compared to existing long-context models.
arXiv Detail & Related papers (2025-07-25T16:19:47Z)
Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers [58.98923344096319]
REFORM is a novel inference framework that efficiently handles long contexts through a two-phase approach.<n>It achieves over 50% and 27% performance gains on RULER and BABILong respectively at 1M context length.<n>It also outperforms baselines on Infinite-Bench and MM-NIAH, demonstrating flexibility across diverse tasks and domains.
arXiv Detail & Related papers (2025-06-01T23:49:14Z)
Training Long-Context LLMs Efficiently via Chunk-wise Optimization [60.05884946552877]
We present textitSequential Chunk-wise Optimization (SeCO), a memory-efficient training paradigm that partitions lengthy inputs into manageable chunks.<n>We also introduce textitSparse Chunk-wise Optimization (SpaCO), which reduces computational overhead by selectively propagating gradients to specific chunks.<n>SpaCO decouples the computational cost of backpropagation from the context length, enabling training time to gradually converge to inference time as sequences become longer.
arXiv Detail & Related papers (2025-05-22T14:11:34Z)
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints [14.341123057506827]
Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure demands significant computational resources. This paper formulates LLM inference optimization as a multi-stage online scheduling problem. We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design.
arXiv Detail & Related papers (2025-04-15T16:00:21Z)
InfiniteICL: Breaking the Limit of Context Window Size via Long Short-term Memory Transformation [57.310236384112834]
In-context learning (ICL) is critical for large language models (LLMs) but its effectiveness is constrained by finite context windows. We introduce InfiniteICL, a framework that parallels context and parameters in LLMs with short- and long-term memory. We demonstrate that our method reduces context length by 90% while achieving 103% average performance of full-context prompting.
arXiv Detail & Related papers (2025-04-02T13:15:44Z)
Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models [33.911521719528686]
Fine-tuning is powerful for adapting large language models to downstream tasks, but it often results in huge memory usages. A promising approach is using Zeroth-Order (ZO) gradients, which estimates to replace First-Order (FO) gradients. We introduce a novel layer-wise sparse computation and memory efficient ZO, named LeZO.
arXiv Detail & Related papers (2024-10-13T12:47:37Z)
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z)
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing. These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators. We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z)
Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models. HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks. A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z)
Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark [166.40879020706151]
This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during fine-tuning. Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques. Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance.
arXiv Detail & Related papers (2024-02-18T14:08:48Z)
RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z)
Scaling Transformer to 1M tokens and beyond with RMT [5.60052250541419]
A major limitation for the broader scope of problems solvable by transformers is the quadratic scaling of computational complexity with input size. In this study, we investigate the recurrent memory augmentation of pre-trained transformer models to extend input context length while linearly scaling compute. Our approach demonstrates the capability to store information in memory for sequences of up to an unprecedented two million tokens while maintaining high retrieval accuracy.
arXiv Detail & Related papers (2023-04-19T16:18:54Z)
Neural Transducer Training: Reduced Memory Consumption with Sample-wise Computation [5.355990925686149]
We propose a memory-efficient training method that computes the transducer loss and gradients sample by sample. We show that our sample-wise method significantly reduces memory usage, and performs at competitive speed when compared to the default batched. As a highlight, we manage to compute the transducer loss and gradients for a batch size of 1024, and audio length of 40 seconds, using only 6 GB of memory.
arXiv Detail & Related papers (2022-11-29T14:57:23Z)
A Low-Complexity Approach to Rate-Distortion Optimized Variable Bit-Rate Compression for Split DNN Computing [5.3221129103999125]
Split computing has emerged as a recent paradigm for implementation of DNN-based AI workloads. We present an approach that addresses the challenge of optimizing the rate-accuracy-complexity trade-off. Our approach is remarkably lightweight, both during training and inference, highly effective and achieves excellent rate-distortion performance.
arXiv Detail & Related papers (2022-08-24T15:02:11Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.