DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the Cloud
- URL: http://arxiv.org/abs/2304.01468v2
- Date: Fri, 28 Jun 2024 09:17:31 GMT
- Title: DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the Cloud
- Authors: Qinlong Wang, Tingfeng Lan, Yinghao Tang, Ziling Huang, Yiheng Du, Haitao Zhang, Jian Sha, Hui Lu, Yuanchun Zhou, Ke Zhang, Mingjie Tang,
- Abstract summary: Deep learning models (DLRM) rely on large embedding tables to manage sparse features.
Expanding such embedding tables can significantly enhance model performance, but at the cost of increased GPU/ CPU/memory usage.
Tech companies have built extensive cloud-based services to accelerate training DLRM models at scale.
We introduce DLRover-RM, an elastic training framework for DLRM to increase resource utilization and handle the instability of a cloud environment.
- Score: 13.996191403653754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning recommendation models (DLRM) rely on large embedding tables to manage categorical sparse features. Expanding such embedding tables can significantly enhance model performance, but at the cost of increased GPU/CPU/memory usage. Meanwhile, tech companies have built extensive cloud-based services to accelerate training DLRM models at scale. In this paper, we conduct a deep investigation of the DLRM training platforms at AntGroup and reveal two critical challenges: low resource utilization due to suboptimal configurations by users and the tendency to encounter abnormalities due to an unstable cloud environment. To overcome them, we introduce DLRover-RM, an elastic training framework for DLRMs designed to increase resource utilization and handle the instability of a cloud environment. DLRover-RM develops a resource-performance model by considering the unique characteristics of DLRMs and a three-stage heuristic strategy to automatically allocate and dynamically adjust resources for DLRM training jobs for higher resource utilization. Further, DLRover-RM develops multiple mechanisms to ensure efficient and reliable execution of DLRM training jobs. Our extensive evaluation shows that DLRover-RM reduces job completion times by 31%, increases the job completion rate by 6%, enhances CPU usage by 15%, and improves memory utilization by 20%, compared to state-of-the-art resource scheduling frameworks. DLRover-RM has been widely deployed at AntGroup and processes thousands of DLRM training jobs on a daily basis. DLRover-RM is open-sourced and has been adopted by 10+ companies.
Related papers
- Secure Resource Allocation via Constrained Deep Reinforcement Learning [49.15061461220109]
We present SARMTO, a framework that balances resource allocation, task offloading, security, and performance.
SARMTO consistently outperforms five baseline approaches, achieving up to a 40% reduction in system costs.
These enhancements highlight SARMTO's potential to revolutionize resource management in intricate distributed computing environments.
arXiv Detail & Related papers (2025-01-20T15:52:43Z) - AI-Driven Resource Allocation Framework for Microservices in Hybrid Cloud Platforms [1.03590082373586]
This paper presents an AI-driven framework for resource allocation among in hybrid cloud platforms.
The framework employs reinforcement learning (RL)-based resource utilization optimization to reduce costs and improve performance.
arXiv Detail & Related papers (2024-12-03T17:41:08Z) - Dual Risk Minimization: Towards Next-Level Robustness in Fine-tuning Zero-Shot Models [60.38983114420845]
We propose dual risk minimization (DRM) to better preserve the core features of downstream tasks.
DRM balances expected performance and worst-case performance, establishing a new state of the art on various real-world benchmarks.
arXiv Detail & Related papers (2024-11-29T15:01:25Z) - DQRM: Deep Quantized Recommendation Models [34.73674946187648]
Large-scale recommendation models are the dominant workload for many large Internet companies.
The size of these 1TB+ tables imposes a severe memory bottleneck for the training and inference of recommendation models.
We propose a novel recommendation framework that is small, powerful, and efficient to run and train, based on the state-of-the-art Deep Learning Recommendation Model (DLRM)
arXiv Detail & Related papers (2024-10-26T02:33:52Z) - Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models.
Our approach employs activation sparsity to extract experts.
Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z) - RRM: Robust Reward Model Training Mitigates Reward Hacking [51.12341734942797]
Reward models (RMs) play a pivotal role in aligning large language models with human preferences.
We introduce a causal framework that learns preferences independent of these artifacts.
Experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model.
arXiv Detail & Related papers (2024-09-20T01:46:07Z) - Semi-Supervised Reward Modeling via Iterative Self-Training [52.48668920483908]
We propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data.
We demonstrate that SSRM significantly improves reward models without incurring additional labeling costs.
Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.
arXiv Detail & Related papers (2024-09-10T22:57:58Z) - MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development.
This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z) - RecD: Deduplication for End-to-End Deep Learning Recommendation Model
Training Infrastructure [3.991664287163157]
RecD addresses immense storage, preprocessing, and training overheads caused by feature duplication inherent in industry-scale DLRM training datasets.
We show how RecD improves the training and preprocessing throughput and storage efficiency by up to 2.48x, 1.79x, and 3.71x, respectively, in an industry-scale DLRM training system.
arXiv Detail & Related papers (2022-11-09T22:21:19Z) - Efficient Fine-Tuning of BERT Models on the Edge [12.768368718187428]
We propose Freeze And Reconfigure (FAR), a memory-efficient training regime for BERT-like models.
FAR reduces fine-tuning time on the DistilBERT model and CoLA dataset by 30%, and time spent on memory operations by 47%.
More broadly, reductions in metric performance on the GLUE and SQuAD datasets are around 1% on average.
arXiv Detail & Related papers (2022-05-03T14:51:53Z) - ECRM: Efficient Fault Tolerance for Recommendation Model Training via
Erasure Coding [1.418033127602866]
Deep-learning recommendation models (DLRMs) are widely deployed to serve personalized content to users.
DLRMs are large in size due to their use of large embedding tables, and are trained by distributing the model across the memory of tens or hundreds of servers.
Checkpointing is the primary approach used for fault tolerance in these systems, but incurs significant training-time overhead.
arXiv Detail & Related papers (2021-04-05T16:16:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.