DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the Cloud
- URL: http://arxiv.org/abs/2304.01468v2
- Date: Fri, 28 Jun 2024 09:17:31 GMT
- Title: DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the Cloud
- Authors: Qinlong Wang, Tingfeng Lan, Yinghao Tang, Ziling Huang, Yiheng Du, Haitao Zhang, Jian Sha, Hui Lu, Yuanchun Zhou, Ke Zhang, Mingjie Tang,
- Abstract summary: Deep learning models (DLRM) rely on large embedding tables to manage sparse features.
Expanding such embedding tables can significantly enhance model performance, but at the cost of increased GPU/ CPU/memory usage.
Tech companies have built extensive cloud-based services to accelerate training DLRM models at scale.
We introduce DLRover-RM, an elastic training framework for DLRM to increase resource utilization and handle the instability of a cloud environment.
- Score: 13.996191403653754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning recommendation models (DLRM) rely on large embedding tables to manage categorical sparse features. Expanding such embedding tables can significantly enhance model performance, but at the cost of increased GPU/CPU/memory usage. Meanwhile, tech companies have built extensive cloud-based services to accelerate training DLRM models at scale. In this paper, we conduct a deep investigation of the DLRM training platforms at AntGroup and reveal two critical challenges: low resource utilization due to suboptimal configurations by users and the tendency to encounter abnormalities due to an unstable cloud environment. To overcome them, we introduce DLRover-RM, an elastic training framework for DLRMs designed to increase resource utilization and handle the instability of a cloud environment. DLRover-RM develops a resource-performance model by considering the unique characteristics of DLRMs and a three-stage heuristic strategy to automatically allocate and dynamically adjust resources for DLRM training jobs for higher resource utilization. Further, DLRover-RM develops multiple mechanisms to ensure efficient and reliable execution of DLRM training jobs. Our extensive evaluation shows that DLRover-RM reduces job completion times by 31%, increases the job completion rate by 6%, enhances CPU usage by 15%, and improves memory utilization by 20%, compared to state-of-the-art resource scheduling frameworks. DLRover-RM has been widely deployed at AntGroup and processes thousands of DLRM training jobs on a daily basis. DLRover-RM is open-sourced and has been adopted by 10+ companies.
Related papers
- CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments [90.29937153770835]
We introduce CRMArena, a benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments.
We show that state-of-the-art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function-calling abilities.
Our findings highlight the need for enhanced agent capabilities in function-calling and rule-following to be deployed in real-world work environments.
arXiv Detail & Related papers (2024-11-04T17:30:51Z) - DQRM: Deep Quantized Recommendation Models [34.73674946187648]
Large-scale recommendation models are the dominant workload for many large Internet companies.
The size of these 1TB+ tables imposes a severe memory bottleneck for the training and inference of recommendation models.
We propose a novel recommendation framework that is small, powerful, and efficient to run and train, based on the state-of-the-art Deep Learning Recommendation Model (DLRM)
arXiv Detail & Related papers (2024-10-26T02:33:52Z) - Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models.
Our approach employs activation sparsity to extract experts.
Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z) - RRM: Robust Reward Model Training Mitigates Reward Hacking [51.12341734942797]
Reward models (RMs) play a pivotal role in aligning large language models with human preferences.
We introduce a causal framework that learns preferences independent of these artifacts.
Experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model.
arXiv Detail & Related papers (2024-09-20T01:46:07Z) - Semi-Supervised Reward Modeling via Iterative Self-Training [52.48668920483908]
We propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data.
We demonstrate that SSRM significantly improves reward models without incurring additional labeling costs.
Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.
arXiv Detail & Related papers (2024-09-10T22:57:58Z) - MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development.
This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z) - InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep
Recommendation Models [3.7414278978078204]
Deep learning-based recommender models (DLRMs) have become an essential component of many modern recommender systems.
The systems challenges faced in this setting are unique; while typical deep learning training jobs are dominated by model execution, the most important factor in DLRM training performance is often online data ingestion.
arXiv Detail & Related papers (2023-08-13T18:28:56Z) - ERM++: An Improved Baseline for Domain Generalization [69.80606575323691]
We show that Empirical Risk Minimization (ERM) can outperform most existing Domain Generalization (DG) methods.
ERM has achieved such strong results while only tuning hyper- parameters such as learning rate, weight decay, batch size, and dropout.
We call the resulting stronger baseline ERM++.
arXiv Detail & Related papers (2023-04-04T17:31:15Z) - RecD: Deduplication for End-to-End Deep Learning Recommendation Model
Training Infrastructure [3.991664287163157]
RecD addresses immense storage, preprocessing, and training overheads caused by feature duplication inherent in industry-scale DLRM training datasets.
We show how RecD improves the training and preprocessing throughput and storage efficiency by up to 2.48x, 1.79x, and 3.71x, respectively, in an industry-scale DLRM training system.
arXiv Detail & Related papers (2022-11-09T22:21:19Z) - Efficient Fine-Tuning of BERT Models on the Edge [12.768368718187428]
We propose Freeze And Reconfigure (FAR), a memory-efficient training regime for BERT-like models.
FAR reduces fine-tuning time on the DistilBERT model and CoLA dataset by 30%, and time spent on memory operations by 47%.
More broadly, reductions in metric performance on the GLUE and SQuAD datasets are around 1% on average.
arXiv Detail & Related papers (2022-05-03T14:51:53Z) - ECRM: Efficient Fault Tolerance for Recommendation Model Training via
Erasure Coding [1.418033127602866]
Deep-learning recommendation models (DLRMs) are widely deployed to serve personalized content to users.
DLRMs are large in size due to their use of large embedding tables, and are trained by distributing the model across the memory of tens or hundreds of servers.
Checkpointing is the primary approach used for fault tolerance in these systems, but incurs significant training-time overhead.
arXiv Detail & Related papers (2021-04-05T16:16:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.