Related papers: MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training

MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training

URL: http://arxiv.org/abs/2407.12117v3
Date: Wed, 15 Jan 2025 08:03:55 GMT
Title: MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training
Authors: Pinxue Zhao, Hailin Zhang, Fangcheng Fu, Xiaonan Nie, Qibin Liu, Fang Yang, Yuanbo Peng, Dian Jiao, Shuaipeng Li, Jinbao Xue, Yangyu Tao, Bin Cui,
Abstract summary: Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications.<n>We propose MEMO, a novel framework for fine-grained activation memory management.<n>MeMO achieves an average of 1.97x and 1.80x MFU compared to Megatron-LM and DeepSpeed.
Score: 24.066283519769968
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Nowadays, Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications. However, long context training poses great challenges considering the constraint of GPU memory. It not only leads to substantial activation memory consumption during training, but also incurs considerable memory fragmentation. To facilitate long context training, existing frameworks have adopted strategies such as recomputation and various forms of parallelisms. Nevertheless, these techniques rely on redundant computation or extensive communication, resulting in low Model FLOPS Utilization (MFU). In this paper, we propose MEMO, a novel LLM training framework designed for fine-grained activation memory management. Given the quadratic scaling of computation and linear scaling of memory with sequence lengths when using FlashAttention, we offload memory-consuming activations to CPU memory after each layer's forward pass and fetch them during the backward pass. To maximize the swapping of activations without hindering computation, and to avoid exhausting limited CPU memory, we implement a token-wise activation recomputation and swapping mechanism. Furthermore, we tackle the memory fragmentation issue by employing a bi-level Mixed Integer Programming (MIP) approach, optimizing memory reuse across transformer layers. Empirical results demonstrate that MEMO achieves an average of 1.97x and 1.80x MFU compared to Megatron-LM and DeepSpeed, respectively. This improvement is attributed to MEMO's ability to minimize memory fragmentation, reduce recomputation and intensive communication, and circumvent the delays associated with the memory reorganization process due to fragmentation. By leveraging fine-grained activation memory management, MEMO facilitates efficient training of 7B LLM with 1 million sequence length on just 8 A800 GPUs, achieving an MFU of 52.30%.

Related papers

MemSifter: Offloading LLM Memory Retrieval via Outcome-Driven Proxy Reasoning [78.46301394559903]
Large Language Models (LLMs) are increasingly used for long-duration tasks.<n>Current methods face a trade-off between cost and accuracy.<n>MemSifter is a novel framework that offloads the memory retrieval process to a small-scale proxy model.
arXiv Detail & Related papers (2026-03-03T02:57:38Z)
Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts [68.79341332280062]
Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time.<n>We introduce OOMB, a highly memory-efficient training system that directly confronts this barrier.<n>Our approach employs a chunk-recurrent training framework with on-the-fly activation recomputation, which maintains a constant activation memory footprint.
arXiv Detail & Related papers (2026-02-02T13:52:40Z)
Mem-T: Densifying Rewards for Long-Horizon Memory Agents [23.19373149519922]
We introduce Mem-T, an autonomous memory agent that interfaces with a lightweight hierarchical memory database to perform dynamic updates and multi-turn retrieval over streaming inputs.<n>We also propose MoT-GRPO, a tree-guided reinforcement learning framework that transforms sparse terminal feedback into dense, step-wise supervision via memory operation tree backpropagation and hindsight credit assignment.
arXiv Detail & Related papers (2026-01-30T14:23:33Z)
MemCtrl: Using MLLMs as Active Memory Controllers on Embodied Agents [53.44122827359892]
We propose MemCtrl, a framework that uses Multimodal Large Language Models (MLLMs) for pruning memory online.<n>-augmented MLLMs show an improvement of around 16% on average, with over 20% on specific instruction subsets.
arXiv Detail & Related papers (2026-01-28T18:31:17Z)
Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents [57.38404718635204]
Large language model (LLM) agents face fundamental limitations in long-horizon reasoning due to finite context windows.<n>Existing methods typically handle long-term memory (LTM) and short-term memory (STM) as separate components.<n>We propose Agentic Memory (AgeMem), a unified framework that integrates LTM and STM management directly into the agent's policy.
arXiv Detail & Related papers (2026-01-05T08:24:16Z)
Mixture-of-Channels: Exploiting Sparse FFNs for Efficient LLMs Pre-Training and Inference [16.71963410333802]
Large language models (LLMs) have demonstrated remarkable success across diverse artificial intelligence tasks.<n>MoC substantially reduces activation memory during pre-training.<n>MoC delivers significant memory savings and throughput gains while maintaining competitive model performance.
arXiv Detail & Related papers (2025-11-12T13:30:57Z)
MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models [72.61076288351201]
We propose Memory-efficient Offloaded Mini-sequence Inference (MOM) MOM partitions critical layers into smaller "mini-sequences" and integrates seamlessly with KV cache offloading. On Meta-Llama-3.2-8B, MOM extends the maximum context length from 155k to 455k tokens on a single A100 80GB GPU.
arXiv Detail & Related papers (2025-04-16T23:15:09Z)
MoM: Linear Sequence Modeling with Mixture-of-Memories [9.665802842933209]
We introduce a novel architecture called Mixture-of-Memories (MoM) MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques.
arXiv Detail & Related papers (2025-02-19T12:53:55Z)
MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs [5.88896081401217]
We introduce MEADOW, a framework that significantly reduces the off-chip memory access for large language models. MEADOW demonstrates 1.5x and 2.5x lower decode and prefill latency, respectively, compared to a GEMM-based LLM implementation. MEADOW achieves an end-to-end latency improvement of over 40%, compared to prior LLM optimization works.
arXiv Detail & Related papers (2025-02-14T23:50:37Z)
Breaking Memory Limits: Gradient Wavelet Transform Enhances LLMs Training [45.225732322141994]
Large language models (LLMs) have impressive performance across a range of natural language processing tasks. Their vast number of parameters introduces significant memory challenges during training. Existing memory-efficient algorithms often rely on techniques such as singular value decomposition projection or weight freezing. We propose a novel solution called Gradient Wavelet Transform (GWT), which applies wavelet transforms to gradients in order to significantly reduce the memory requirements.
arXiv Detail & Related papers (2025-01-13T11:35:09Z)
CompAct: Compressed Activations for Memory-Efficient LLM Training [7.837209773889032]
CompAct is a technique that reduces peak memory utilization on GPU by 25-30% for pretraining and 50% for fine-tuning of LLMs. By storing low-rank, compressed activations to be used in the backward pass we greatly reduce the required memory. We expect CompAct's savings to scale even higher for larger models.
arXiv Detail & Related papers (2024-10-20T10:24:38Z)
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z)
Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training [78.93900796545523]
Mini-Sequence Transformer (MsT) is a methodology for highly efficient and accurate LLM training with extremely long sequences. MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage. integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.
arXiv Detail & Related papers (2024-07-22T01:52:30Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
$\text{Memory}^3$: Language Modeling with Explicit Memory [22.572376536612015]
We equip large language models (LLMs) with explicit memory, a memory format cheaper than model parameters and text retrieval-augmented generation (RAG) As a preliminary proof of concept, we train from scratch a 2.4B LLM, which achieves better performance than much larger LLMs and RAG models. We introduce a memory circuitry theory to support the externalization of knowledge, and present novel techniques including a memory sparsification mechanism that makes storage tractable.
arXiv Detail & Related papers (2024-07-01T11:07:23Z)
MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory [49.96019697955383]
We introduce MemLLM, a novel method of enhancing large language models (LLMs) by integrating a structured and explicit read-and-write memory module. Our experiments indicate that MemLLM enhances the LLM's performance and interpretability, in language modeling in general and knowledge-intensive tasks in particular.
arXiv Detail & Related papers (2024-04-17T18:13:16Z)
Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models. HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks. A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z)
MAMBA: Multi-level Aggregation via Memory Bank for Video Object Detection [35.16197118579414]
We propose a multi-level aggregation architecture via memory bank called MAMBA. Specifically, our memory bank employs two novel operations to eliminate the disadvantages of existing methods. Compared with existing state-of-the-art methods, our method achieves superior performance in terms of both speed and accuracy.
arXiv Detail & Related papers (2024-01-18T12:13:06Z)
RMM: Reinforced Memory Management for Class-Incremental Learning [102.20140790771265]
Class-Incremental Learning (CIL) trains classifiers under a strict memory budget. Existing methods use a static and ad hoc strategy for memory allocation, which is often sub-optimal. We propose a dynamic memory management strategy that is optimized for the incremental phases and different object classes.
arXiv Detail & Related papers (2023-01-14T00:07:47Z)
Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers. Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training. Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z)
Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling. Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)
In-memory Implementation of On-chip Trainable and Scalable ANN for AI/ML Applications [0.0]
This paper presents an in-memory computing architecture for ANN enabling artificial intelligence (AI) and machine learning (ML) applications. Our novel on-chip training and inference in-memory architecture reduces energy cost and enhances throughput by simultaneously accessing the multiple rows of array per precharge cycle. The proposed architecture was trained and tested on the IRIS dataset which exhibits $46times$ more energy efficient per MAC (multiply and accumulate) operation compared to earlier classifiers.
arXiv Detail & Related papers (2020-05-19T15:36:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.