MTrainS: Improving DLRM training efficiency using heterogeneous memories
- URL: http://arxiv.org/abs/2305.01515v1
- Date: Wed, 19 Apr 2023 06:06:06 GMT
- Title: MTrainS: Improving DLRM training efficiency using heterogeneous memories
- Authors: Hiwot Tadese Kassa, Paul Johnson, Jason Akers, Mrinmoy Ghosh, Andrew
Tulloch, Dheevatsa Mudigere, Jongsoo Park, Xing Liu, Ronald Dreslinski, Ehsan
K. Ardestani
- Abstract summary: In Deep Learning Recommendation Models (DLRM), sparse features capturing categorical inputs through embedding tables are the major contributors to model size and require high memory bandwidth.
In this paper, we study the bandwidth requirement and locality of embedding tables in real-world deployed models.
We then design MTrainS, which leverages heterogeneous memory, including byte and block addressable Storage Class Memory for DLRM hierarchically.
- Score: 5.195887979684162
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recommendation models are very large, requiring terabytes (TB) of memory
during training. In pursuit of better quality, the model size and complexity
grow over time, which requires additional training data to avoid overfitting.
This model growth demands a large number of resources in data centers. Hence,
training efficiency is becoming considerably more important to keep the data
center power demand manageable. In Deep Learning Recommendation Models (DLRM),
sparse features capturing categorical inputs through embedding tables are the
major contributors to model size and require high memory bandwidth. In this
paper, we study the bandwidth requirement and locality of embedding tables in
real-world deployed models. We observe that the bandwidth requirement is not
uniform across different tables and that embedding tables show high temporal
locality. We then design MTrainS, which leverages heterogeneous memory,
including byte and block addressable Storage Class Memory for DLRM
hierarchically. MTrainS allows for higher memory capacity per node and
increases training efficiency by lowering the need to scale out to multiple
hosts in memory capacity bound use cases. By optimizing the platform memory
hierarchy, we reduce the number of nodes for training by 4-8X, saving power and
cost of training while meeting our target training performance.
Related papers
- Block Selective Reprogramming for On-device Training of Vision Transformers [12.118303034660531]
We present block selective reprogramming (BSR) in which we fine-tune only a fraction of total blocks of a pre-trained model.
Compared to the existing alternatives, our approach simultaneously reduces training memory by up to 1.4x and compute cost by up to 2x.
arXiv Detail & Related papers (2024-03-25T08:41:01Z) - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection [133.45193150403537]
Training Large Language Models (LLMs) presents significant memory challenges due to the growing size of weights and GPU states.
In this work, we propose Gradient Low-Rank Projection (GaLore) as a memory-efficient training strategy.
Our 8-bit GaLore further reduces memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.
arXiv Detail & Related papers (2024-03-06T07:29:57Z) - Fine-Grained Embedding Dimension Optimization During Training for Recommender Systems [17.602059421895856]
FIITED is a system to automatically reduce the memory footprint via FIne-grained In-Training Embedding Dimension pruning.
We show that FIITED can reduce DLRM embedding size by more than 65% while preserving model quality.
On public datasets, FIITED can reduce the size of embedding tables by 2.1x to 800x with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-09T08:04:11Z) - A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental
Learning [56.450090618578]
Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement.
We show that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work.
We propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel.
arXiv Detail & Related papers (2022-05-26T08:24:01Z) - Efficient Fine-Tuning of BERT Models on the Edge [12.768368718187428]
We propose Freeze And Reconfigure (FAR), a memory-efficient training regime for BERT-like models.
FAR reduces fine-tuning time on the DistilBERT model and CoLA dataset by 30%, and time spent on memory operations by 47%.
More broadly, reductions in metric performance on the GLUE and SQuAD datasets are around 1% on average.
arXiv Detail & Related papers (2022-05-03T14:51:53Z) - Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers.
Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training.
Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z) - High-performance, Distributed Training of Large-scale Deep Learning
Recommendation Models [18.63017668881868]
Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook.
In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs.
We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems.
arXiv Detail & Related papers (2021-04-12T02:15:55Z) - Semantically Constrained Memory Allocation (SCMA) for Embedding in
Efficient Recommendation Systems [27.419109620575313]
A key challenge for deep learning models is to work with millions of categorical classes or tokens.
We propose a novel formulation of memory shared embedding, where memory is shared in proportion to the overlap in semantic information.
We demonstrate a significant reduction in the memory footprint while maintaining performance.
arXiv Detail & Related papers (2021-02-24T19:55:49Z) - SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and
Training [82.35376405568975]
Deep neural networks (DNNs) come with heavy parameterization, leading to external dynamic random-access memory (DRAM) for storage.
We present SmartDeal (SD), an algorithm framework to trade higher-cost memory storage/access for lower-cost computation.
We show that SD leads to 10.56x and 4.48x reduction in the storage and training energy, with negligible accuracy loss compared to state-of-the-art training baselines.
arXiv Detail & Related papers (2021-01-04T18:54:07Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.