ECRM: Efficient Fault Tolerance for Recommendation Model Training via
Erasure Coding
- URL: http://arxiv.org/abs/2104.01981v1
- Date: Mon, 5 Apr 2021 16:16:19 GMT
- Title: ECRM: Efficient Fault Tolerance for Recommendation Model Training via
Erasure Coding
- Authors: Kaige Liu, Jack Kosaian, K. V. Rashmi
- Abstract summary: Deep-learning recommendation models (DLRMs) are widely deployed to serve personalized content to users.
DLRMs are large in size due to their use of large embedding tables, and are trained by distributing the model across the memory of tens or hundreds of servers.
Checkpointing is the primary approach used for fault tolerance in these systems, but incurs significant training-time overhead.
- Score: 1.418033127602866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep-learning-based recommendation models (DLRMs) are widely deployed to
serve personalized content to users. DLRMs are large in size due to their use
of large embedding tables, and are trained by distributing the model across the
memory of tens or hundreds of servers. Server failures are common in such large
distributed systems and must be mitigated to enable training to progress.
Checkpointing is the primary approach used for fault tolerance in these
systems, but incurs significant training-time overhead both during normal
operation and when recovering from failures. As these overheads increase with
DLRM size, checkpointing is slated to become an even larger overhead for future
DLRMs, which are expected to grow in size. This calls for rethinking fault
tolerance in DLRM training.
We present ECRM, a DLRM training system that achieves efficient fault
tolerance using erasure coding. ECRM chooses which DLRM parameters to encode,
correctly and efficiently updates parities, and enables training to proceed
without any pauses, while maintaining consistency of the recovered parameters.
We implement ECRM atop XDL, an open-source, industrial-scale DLRM training
system. Compared to checkpointing, ECRM reduces training-time overhead for
large DLRMs by up to 88%, recovers from failures up to 10.3$\times$ faster, and
allows training to proceed during recovery. These results show the promise of
erasure coding in imparting efficient fault tolerance to training current and
future DLRMs.
Related papers
- Blockchain-enabled Trustworthy Federated Unlearning [50.01101423318312]
Federated unlearning is a promising paradigm for protecting the data ownership of distributed clients.
Existing works require central servers to retain the historical model parameters from distributed clients.
This paper proposes a new blockchain-enabled trustworthy federated unlearning framework.
arXiv Detail & Related papers (2024-01-29T07:04:48Z) - Fast Machine Unlearning Without Retraining Through Selective Synaptic
Dampening [51.34904967046097]
Selective Synaptic Dampening (SSD) is a fast, performant, and does not require long-term storage of the training data.
We present a novel two-step, post hoc, retrain-free approach to machine unlearning which is fast, performant, and does not require long-term storage of the training data.
arXiv Detail & Related papers (2023-08-15T11:30:45Z) - MTrainS: Improving DLRM training efficiency using heterogeneous memories [5.195887979684162]
In Deep Learning Recommendation Models (DLRM), sparse features capturing categorical inputs through embedding tables are the major contributors to model size and require high memory bandwidth.
In this paper, we study the bandwidth requirement and locality of embedding tables in real-world deployed models.
We then design MTrainS, which leverages heterogeneous memory, including byte and block addressable Storage Class Memory for DLRM hierarchically.
arXiv Detail & Related papers (2023-04-19T06:06:06Z) - ERM++: An Improved Baseline for Domain Generalization [69.80606575323691]
We show that Empirical Risk Minimization (ERM) can outperform most existing Domain Generalization (DG) methods.
ERM has achieved such strong results while only tuning hyper- parameters such as learning rate, weight decay, batch size, and dropout.
We call the resulting stronger baseline ERM++.
arXiv Detail & Related papers (2023-04-04T17:31:15Z) - DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the Cloud [13.996191403653754]
Deep learning models (DLRM) rely on large embedding tables to manage sparse features.
Expanding such embedding tables can significantly enhance model performance, but at the cost of increased GPU/ CPU/memory usage.
Tech companies have built extensive cloud-based services to accelerate training DLRM models at scale.
We introduce DLRover-RM, an elastic training framework for DLRM to increase resource utilization and handle the instability of a cloud environment.
arXiv Detail & Related papers (2023-04-04T02:13:46Z) - RecD: Deduplication for End-to-End Deep Learning Recommendation Model
Training Infrastructure [3.991664287163157]
RecD addresses immense storage, preprocessing, and training overheads caused by feature duplication inherent in industry-scale DLRM training datasets.
We show how RecD improves the training and preprocessing throughput and storage efficiency by up to 2.48x, 1.79x, and 3.71x, respectively, in an industry-scale DLRM training system.
arXiv Detail & Related papers (2022-11-09T22:21:19Z) - FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge
Computing Migrations [55.131858975133085]
FIRE is a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment.
We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function.
We show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
arXiv Detail & Related papers (2022-09-28T19:49:39Z) - The trade-offs of model size in large recommendation models : A 10000
$\times$ compressed criteo-tb DLRM model (100 GB parameters to mere 10MB) [40.623439224839245]
Embedding tables dominate industrial-scale recommendation model sizes, using up to terabytes of memory.
This paper analyzes and extensively evaluates a generic parameter sharing setup (PSS) for compressing DLRM models.
We show that scales are tipped towards having a smaller DLRM model, leading to faster inference, easier deployment, and similar training times.
arXiv Detail & Related papers (2022-07-21T19:50:34Z) - Sparse-Push: Communication- & Energy-Efficient Decentralized Distributed
Learning over Directed & Time-Varying Graphs with non-IID Datasets [2.518955020930418]
We propose Sparse-Push, a communication efficient decentralized distributed training algorithm.
The proposed algorithm enables 466x reduction in communication with only 1% degradation in performance.
We demonstrate how communication compression can lead to significant performance degradation in-case of non-IID datasets.
arXiv Detail & Related papers (2021-02-10T19:41:11Z) - SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and
Training [82.35376405568975]
Deep neural networks (DNNs) come with heavy parameterization, leading to external dynamic random-access memory (DRAM) for storage.
We present SmartDeal (SD), an algorithm framework to trade higher-cost memory storage/access for lower-cost computation.
We show that SD leads to 10.56x and 4.48x reduction in the storage and training energy, with negligible accuracy loss compared to state-of-the-art training baselines.
arXiv Detail & Related papers (2021-01-04T18:54:07Z) - Training Recommender Systems at Scale: Communication-Efficient Model and
Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training.
DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively.
It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.