MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training
- URL: http://arxiv.org/abs/2407.12117v3
- Date: Wed, 15 Jan 2025 08:03:55 GMT
- Title: MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training
- Authors: Pinxue Zhao, Hailin Zhang, Fangcheng Fu, Xiaonan Nie, Qibin Liu, Fang Yang, Yuanbo Peng, Dian Jiao, Shuaipeng Li, Jinbao Xue, Yangyu Tao, Bin Cui,
- Abstract summary: Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications.
We propose MEMO, a novel framework for fine-grained activation memory management.
MeMO achieves an average of 1.97x and 1.80x MFU compared to Megatron-LM and DeepSpeed.
- Score: 24.066283519769968
- License:
- Abstract: Nowadays, Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications. However, long context training poses great challenges considering the constraint of GPU memory. It not only leads to substantial activation memory consumption during training, but also incurs considerable memory fragmentation. To facilitate long context training, existing frameworks have adopted strategies such as recomputation and various forms of parallelisms. Nevertheless, these techniques rely on redundant computation or extensive communication, resulting in low Model FLOPS Utilization (MFU). In this paper, we propose MEMO, a novel LLM training framework designed for fine-grained activation memory management. Given the quadratic scaling of computation and linear scaling of memory with sequence lengths when using FlashAttention, we offload memory-consuming activations to CPU memory after each layer's forward pass and fetch them during the backward pass. To maximize the swapping of activations without hindering computation, and to avoid exhausting limited CPU memory, we implement a token-wise activation recomputation and swapping mechanism. Furthermore, we tackle the memory fragmentation issue by employing a bi-level Mixed Integer Programming (MIP) approach, optimizing memory reuse across transformer layers. Empirical results demonstrate that MEMO achieves an average of 1.97x and 1.80x MFU compared to Megatron-LM and DeepSpeed, respectively. This improvement is attributed to MEMO's ability to minimize memory fragmentation, reduce recomputation and intensive communication, and circumvent the delays associated with the memory reorganization process due to fragmentation. By leveraging fine-grained activation memory management, MEMO facilitates efficient training of 7B LLM with 1 million sequence length on just 8 A800 GPUs, achieving an MFU of 52.30%.
Related papers
- MoM: Linear Sequence Modeling with Mixture-of-Memories [9.665802842933209]
We introduce a novel architecture called Mixture-of-Memories (MoM)
MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states.
MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques.
arXiv Detail & Related papers (2025-02-19T12:53:55Z) - Breaking Memory Limits: Gradient Wavelet Transform Enhances LLMs Training [45.225732322141994]
Large language models (LLMs) have impressive performance across a range of natural language processing tasks.
Their vast number of parameters introduces significant memory challenges during training.
Existing memory-efficient algorithms often rely on techniques such as singular value decomposition projection or weight freezing.
We propose a novel solution called Gradient Wavelet Transform (GWT), which applies wavelet transforms to gradients in order to significantly reduce the memory requirements.
arXiv Detail & Related papers (2025-01-13T11:35:09Z) - CompAct: Compressed Activations for Memory-Efficient LLM Training [7.837209773889032]
CompAct is a technique that reduces peak memory utilization on GPU by 25-30% for pretraining and 50% for fine-tuning of LLMs.
By storing low-rank, compressed activations to be used in the backward pass we greatly reduce the required memory.
We expect CompAct's savings to scale even higher for larger models.
arXiv Detail & Related papers (2024-10-20T10:24:38Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - $\text{Memory}^3$: Language Modeling with Explicit Memory [22.572376536612015]
We equip large language models (LLMs) with explicit memory, a memory format cheaper than model parameters and text retrieval-augmented generation (RAG)
As a preliminary proof of concept, we train from scratch a 2.4B LLM, which achieves better performance than much larger LLMs and RAG models.
We introduce a memory circuitry theory to support the externalization of knowledge, and present novel techniques including a memory sparsification mechanism that makes storage tractable.
arXiv Detail & Related papers (2024-07-01T11:07:23Z) - MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory [49.96019697955383]
We introduce MemLLM, a novel method of enhancing large language models (LLMs) by integrating a structured and explicit read-and-write memory module.
Our experiments indicate that MemLLM enhances the LLM's performance and interpretability, in language modeling in general and knowledge-intensive tasks in particular.
arXiv Detail & Related papers (2024-04-17T18:13:16Z) - Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - RMM: Reinforced Memory Management for Class-Incremental Learning [102.20140790771265]
Class-Incremental Learning (CIL) trains classifiers under a strict memory budget.
Existing methods use a static and ad hoc strategy for memory allocation, which is often sub-optimal.
We propose a dynamic memory management strategy that is optimized for the incremental phases and different object classes.
arXiv Detail & Related papers (2023-01-14T00:07:47Z) - Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers.
Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training.
Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.