Related papers: Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training

Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training

URL: http://arxiv.org/abs/2407.15892v4
Date: Sat, 09 Nov 2024 15:09:19 GMT
Title: Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training
Authors: Cheng Luo, Jiawei Zhao, Zhuoming Chen, Beidi Chen, Anima Anandkumar,
Abstract summary: Mini-Sequence Transformer (MsT) is a methodology for highly efficient and accurate LLM training with extremely long sequences. MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage. integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.
Score: 78.93900796545523
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Mini-Sequence Transformer (MsT), a simple and effective methodology for highly efficient and accurate LLM training with extremely long sequences. MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage. Integrated with activation recomputation, it enables significant memory savings in both forward and backward passes. In experiments with the Llama3-8B model, with MsT, we measure no degradation in throughput or convergence even with 12x longer sequences than standard implementations. MsT is fully general, implementation-agnostic, and requires minimal code changes to integrate with existing LLM training frameworks. Integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.

Related papers

CoMeT: Collaborative Memory Transformer for Efficient Long Context Modeling [40.705016911274]
We introduce a novel architecture that enables LLMs to handle arbitrarily long sequences with constant memory usage and linear time complexity.<n>CoMeT can be integrated into pre-trained models with only minimal fine-tuning.<n>A model equipped with CoMeT and fine-tuned on 32k contexts can accurately retrieve a passkey from any position within a 1M token sequence.
arXiv Detail & Related papers (2026-02-02T07:49:44Z)
ParaDySe: A Parallel-Strategy Switching Framework for Dynamic Sequence Lengths in Transformer [8.224366948749838]
ParaDySe is a novel adaptive Parallel strategy switching framework for Dynamic Sequences.<n>ParaDySe enables on-the-fly optimal strategy adoption according to the immediate input sequence.
arXiv Detail & Related papers (2025-11-17T10:08:24Z)
TNT: Improving Chunkwise Training for Test-Time Memorization [62.78875147721906]
Recurrent neural networks (RNNs) with deep test-time memorization modules, such as Titans and TTT, represent a promising, linearly-scaling paradigm distinct from Transformers.<n>We introduce TNT, a novel training paradigm that decouples training efficiency from inference performance through a two-stage process.<n>TNT achieves a substantial acceleration in training speed-up to 17 times faster than the most accurate baseline configuration.
arXiv Detail & Related papers (2025-11-10T17:45:09Z)
ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time [5.554829574749047]
ourmodelacronym(Extend at Test-Time) is a method for extending the context length of short context Transformer-based Language Models.<n>We evaluate ETT on LongBench by extending the context length of GPT-Large and Phi-2 up to 32 times, increasing from 1k to 32k tokens.
arXiv Detail & Related papers (2025-07-08T18:06:45Z)
Length Controlled Generation for Black-box LLMs [70.57649832433451]
Large language models (LLMs) have demonstrated impressive instruction following capabilities, but struggle to accurately manage the length of generated text. We propose a novel iterative sampling framework for text length control, integrating the Metropolis-Hastings algorithm with an importance sampling acceleration strategy. Our framework achieves almost 100% success rates of length control on Llama3.1 for tasks such as length-controlled abstractive summarization.
arXiv Detail & Related papers (2024-12-19T09:07:38Z)
Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [63.10833446782114]
As language models grow in size, memory demands for backpropagation increase.<n>Zeroth-order (ZO) optimization methods offer a memory-efficient alternative.<n>In this paper, we propose Subspace Zero-order optimization to address the challenges posed by posed by high dimensionality perturbations.
arXiv Detail & Related papers (2024-10-11T17:01:43Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs [24.066283519769968]
Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications. We propose MEMO, a novel framework for fine-grained activation memory management. We show that MEMO achieves an average of 2.42x and 2.26x MFU compared to Megatron-LM and DeepSpeed.
arXiv Detail & Related papers (2024-07-16T18:59:49Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models. HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks. A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z)
Variance-reduced Zeroth-Order Methods for Fine-Tuning Language Models [17.027512781038617]
Zeroth-order (ZO) optimization methods can leverage memory-efficient forward passes to estimate. MeZO, an adaptation of ZO-SGD, has been shown to consistently outperform zero-shot and in-context learning. MeZO-SVRG significantly reduces the required memory footprint compared to first-order SGD.
arXiv Detail & Related papers (2024-04-11T18:35:49Z)
InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory [93.20588235940453]
In this paper, we introduce a training-free memory-based method, InfLLM. InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention. Even when the sequence length is scaled to $1,024$K, InfLLM still effectively captures long-distance dependencies.
arXiv Detail & Related papers (2024-02-07T06:50:42Z)
Ultra-Long Sequence Distributed Transformer [10.263668150008316]
Transformer models trained on long sequences often achieve higher accuracy than short sequences. Existing methods for long sequence training offer limited speedup and memory reduction. This paper presents a novel and efficient distributed training method, the Long Short-Sequence Transformer.
arXiv Detail & Related papers (2023-11-04T11:38:53Z)
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models [34.74093040678323]
We introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and scalable LLM training. DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention. Experiments show that DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline.
arXiv Detail & Related papers (2023-09-25T20:15:57Z)
Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.