Related papers: ECRM: Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding

ECRM: Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding

URL: http://arxiv.org/abs/2104.01981v1
Date: Mon, 5 Apr 2021 16:16:19 GMT
Title: ECRM: Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding
Authors: Kaige Liu, Jack Kosaian, K. V. Rashmi
Abstract summary: Deep-learning recommendation models (DLRMs) are widely deployed to serve personalized content to users. DLRMs are large in size due to their use of large embedding tables, and are trained by distributing the model across the memory of tens or hundreds of servers. Checkpointing is the primary approach used for fault tolerance in these systems, but incurs significant training-time overhead.
Score: 1.418033127602866
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep-learning-based recommendation models (DLRMs) are widely deployed to serve personalized content to users. DLRMs are large in size due to their use of large embedding tables, and are trained by distributing the model across the memory of tens or hundreds of servers. Server failures are common in such large distributed systems and must be mitigated to enable training to progress. Checkpointing is the primary approach used for fault tolerance in these systems, but incurs significant training-time overhead both during normal operation and when recovering from failures. As these overheads increase with DLRM size, checkpointing is slated to become an even larger overhead for future DLRMs, which are expected to grow in size. This calls for rethinking fault tolerance in DLRM training. We present ECRM, a DLRM training system that achieves efficient fault tolerance using erasure coding. ECRM chooses which DLRM parameters to encode, correctly and efficiently updates parities, and enables training to proceed without any pauses, while maintaining consistency of the recovered parameters. We implement ECRM atop XDL, an open-source, industrial-scale DLRM training system. Compared to checkpointing, ECRM reduces training-time overhead for large DLRMs by up to 88%, recovers from failures up to 10.3$\times$ faster, and allows training to proceed during recovery. These results show the promise of erasure coding in imparting efficient fault tolerance to training current and future DLRMs.

Related papers

SCRec: A Scalable Computational Storage System with Statistical Sharding and Tensor-train Decomposition for Recommendation Models [17.602518628415776]
Deep Learning Recommendation Models (DLRMs) play a crucial role in delivering personalized content across web applications such as social networking and video streaming. With improvements in performance, the parameter size of DLRMs has grown to terabyte (TB) scales, accompanied by memory bandwidth demands exceeding TB/s levels. We propose SCRec, a scalable computational storage recommendation system that can handle TB-scale industrial DLRMs.
arXiv Detail & Related papers (2025-04-01T08:12:45Z)
Dual Risk Minimization: Towards Next-Level Robustness in Fine-tuning Zero-Shot Models [60.38983114420845]
We propose dual risk minimization (DRM) to better preserve the core features of downstream tasks. DRM balances expected performance and worst-case performance, establishing a new state of the art on various real-world benchmarks.
arXiv Detail & Related papers (2024-11-29T15:01:25Z)
DQRM: Deep Quantized Recommendation Models [34.73674946187648]
Large-scale recommendation models are the dominant workload for many large Internet companies. The size of these 1TB+ tables imposes a severe memory bottleneck for the training and inference of recommendation models. We propose a novel recommendation framework that is small, powerful, and efficient to run and train, based on the state-of-the-art Deep Learning Recommendation Model (DLRM)
arXiv Detail & Related papers (2024-10-26T02:33:52Z)
RRM: Robust Reward Model Training Mitigates Reward Hacking [51.12341734942797]
Reward models (RMs) play a pivotal role in aligning large language models with human preferences. We introduce a causal framework that learns preferences independent of these artifacts. Experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model.
arXiv Detail & Related papers (2024-09-20T01:46:07Z)
Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them. This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model. OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z)
Semi-Supervised Reward Modeling via Iterative Self-Training [52.48668920483908]
We propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data. We demonstrate that SSRM significantly improves reward models without incurring additional labeling costs. Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.
arXiv Detail & Related papers (2024-09-10T22:57:58Z)
Fast Machine Unlearning Without Retraining Through Selective Synaptic Dampening [51.34904967046097]
Selective Synaptic Dampening (SSD) is a fast, performant, and does not require long-term storage of the training data. We present a novel two-step, post hoc, retrain-free approach to machine unlearning which is fast, performant, and does not require long-term storage of the training data.
arXiv Detail & Related papers (2023-08-15T11:30:45Z)
MTrainS: Improving DLRM training efficiency using heterogeneous memories [5.195887979684162]
In Deep Learning Recommendation Models (DLRM), sparse features capturing categorical inputs through embedding tables are the major contributors to model size and require high memory bandwidth. In this paper, we study the bandwidth requirement and locality of embedding tables in real-world deployed models. We then design MTrainS, which leverages heterogeneous memory, including byte and block addressable Storage Class Memory for DLRM hierarchically.
arXiv Detail & Related papers (2023-04-19T06:06:06Z)
ERM++: An Improved Baseline for Domain Generalization [69.80606575323691]
We show that Empirical Risk Minimization (ERM) can outperform most existing Domain Generalization (DG) methods. ERM has achieved such strong results while only tuning hyper- parameters such as learning rate, weight decay, batch size, and dropout. We call the resulting stronger baseline ERM++.
arXiv Detail & Related papers (2023-04-04T17:31:15Z)
DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the Cloud [13.996191403653754]
Deep learning models (DLRM) rely on large embedding tables to manage sparse features. Expanding such embedding tables can significantly enhance model performance, but at the cost of increased GPU/ CPU/memory usage. Tech companies have built extensive cloud-based services to accelerate training DLRM models at scale. We introduce DLRover-RM, an elastic training framework for DLRM to increase resource utilization and handle the instability of a cloud environment.
arXiv Detail & Related papers (2023-04-04T02:13:46Z)
RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure [3.991664287163157]
RecD addresses immense storage, preprocessing, and training overheads caused by feature duplication inherent in industry-scale DLRM training datasets. We show how RecD improves the training and preprocessing throughput and storage efficiency by up to 2.48x, 1.79x, and 3.71x, respectively, in an industry-scale DLRM training system.
arXiv Detail & Related papers (2022-11-09T22:21:19Z)
FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations [52.85536740465277]
FIRE is a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment. We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function. We show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
arXiv Detail & Related papers (2022-09-28T19:49:39Z)
The trade-offs of model size in large recommendation models : A 10000 $\times$ compressed criteo-tb DLRM model (100 GB parameters to mere 10MB) [40.623439224839245]
Embedding tables dominate industrial-scale recommendation model sizes, using up to terabytes of memory. This paper analyzes and extensively evaluates a generic parameter sharing setup (PSS) for compressing DLRM models. We show that scales are tipped towards having a smaller DLRM model, leading to faster inference, easier deployment, and similar training times.
arXiv Detail & Related papers (2022-07-21T19:50:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.