Related papers: DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the Cloud

DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the Cloud

URL: http://arxiv.org/abs/2304.01468v2
Date: Fri, 28 Jun 2024 09:17:31 GMT
Title: DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the Cloud
Authors: Qinlong Wang, Tingfeng Lan, Yinghao Tang, Ziling Huang, Yiheng Du, Haitao Zhang, Jian Sha, Hui Lu, Yuanchun Zhou, Ke Zhang, Mingjie Tang,
Abstract summary: Deep learning models (DLRM) rely on large embedding tables to manage sparse features. Expanding such embedding tables can significantly enhance model performance, but at the cost of increased GPU/ CPU/memory usage. Tech companies have built extensive cloud-based services to accelerate training DLRM models at scale. We introduce DLRover-RM, an elastic training framework for DLRM to increase resource utilization and handle the instability of a cloud environment.
Score: 13.996191403653754
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning recommendation models (DLRM) rely on large embedding tables to manage categorical sparse features. Expanding such embedding tables can significantly enhance model performance, but at the cost of increased GPU/CPU/memory usage. Meanwhile, tech companies have built extensive cloud-based services to accelerate training DLRM models at scale. In this paper, we conduct a deep investigation of the DLRM training platforms at AntGroup and reveal two critical challenges: low resource utilization due to suboptimal configurations by users and the tendency to encounter abnormalities due to an unstable cloud environment. To overcome them, we introduce DLRover-RM, an elastic training framework for DLRMs designed to increase resource utilization and handle the instability of a cloud environment. DLRover-RM develops a resource-performance model by considering the unique characteristics of DLRMs and a three-stage heuristic strategy to automatically allocate and dynamically adjust resources for DLRM training jobs for higher resource utilization. Further, DLRover-RM develops multiple mechanisms to ensure efficient and reliable execution of DLRM training jobs. Our extensive evaluation shows that DLRover-RM reduces job completion times by 31%, increases the job completion rate by 6%, enhances CPU usage by 15%, and improves memory utilization by 20%, compared to state-of-the-art resource scheduling frameworks. DLRover-RM has been widely deployed at AntGroup and processes thousands of DLRM training jobs on a daily basis. DLRover-RM is open-sourced and has been adopted by 10+ companies.

Related papers

MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization [74.04867639197445]
MiroMind-M1 is a set of fully open-source RLMs built on the Qwen-2.5-based benchmarks.<n>Our models are trained in two stages: SFT on a carefully curated corpus of 719K math-reasoning problems with verified CoT trajectories, followed by RLVR on 62K challenging and verifiable problems.
arXiv Detail & Related papers (2025-07-19T16:21:23Z)
R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning [22.167272219418845]
Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs)<n>We propose the StableReinforce algorithm, which refines the training loss, advantage estimation strategy, and reward design of existing RL methods.<n>Our reward model, R1-Reward, trained using the StableReinforce algorithm on this dataset, significantly improves performance on multimodal reward modeling benchmarks.
arXiv Detail & Related papers (2025-05-05T17:59:50Z)
Secure Resource Allocation via Constrained Deep Reinforcement Learning [49.15061461220109]
We present SARMTO, a framework that balances resource allocation, task offloading, security, and performance. SARMTO consistently outperforms five baseline approaches, achieving up to a 40% reduction in system costs. These enhancements highlight SARMTO's potential to revolutionize resource management in intricate distributed computing environments.
arXiv Detail & Related papers (2025-01-20T15:52:43Z)
AI-Driven Resource Allocation Framework for Microservices in Hybrid Cloud Platforms [1.03590082373586]
This paper presents an AI-driven framework for resource allocation among in hybrid cloud platforms. The framework employs reinforcement learning (RL)-based resource utilization optimization to reduce costs and improve performance.
arXiv Detail & Related papers (2024-12-03T17:41:08Z)
Dual Risk Minimization: Towards Next-Level Robustness in Fine-tuning Zero-Shot Models [60.38983114420845]
We propose dual risk minimization (DRM) to better preserve the core features of downstream tasks. DRM balances expected performance and worst-case performance, establishing a new state of the art on various real-world benchmarks.
arXiv Detail & Related papers (2024-11-29T15:01:25Z)
CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments [90.29937153770835]
We introduce CRMArena, a benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments. We show that state-of-the-art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function-calling abilities. Our findings highlight the need for enhanced agent capabilities in function-calling and rule-following to be deployed in real-world work environments.
arXiv Detail & Related papers (2024-11-04T17:30:51Z)
DQRM: Deep Quantized Recommendation Models [34.73674946187648]
Large-scale recommendation models are the dominant workload for many large Internet companies. The size of these 1TB+ tables imposes a severe memory bottleneck for the training and inference of recommendation models. We propose a novel recommendation framework that is small, powerful, and efficient to run and train, based on the state-of-the-art Deep Learning Recommendation Model (DLRM)
arXiv Detail & Related papers (2024-10-26T02:33:52Z)
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models. Our approach employs activation sparsity to extract experts. Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z)
RRM: Robust Reward Model Training Mitigates Reward Hacking [51.12341734942797]
Reward models (RMs) play a pivotal role in aligning large language models with human preferences. We introduce a causal framework that learns preferences independent of these artifacts. Experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model.
arXiv Detail & Related papers (2024-09-20T01:46:07Z)
Semi-Supervised Reward Modeling via Iterative Self-Training [52.48668920483908]
We propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data. We demonstrate that SSRM significantly improves reward models without incurring additional labeling costs. Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.
arXiv Detail & Related papers (2024-09-10T22:57:58Z)
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development. This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z)
InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models [3.7414278978078204]
Deep learning-based recommender models (DLRMs) have become an essential component of many modern recommender systems. The systems challenges faced in this setting are unique; while typical deep learning training jobs are dominated by model execution, the most important factor in DLRM training performance is often online data ingestion.
arXiv Detail & Related papers (2023-08-13T18:28:56Z)
ERM++: An Improved Baseline for Domain Generalization [69.80606575323691]
We show that Empirical Risk Minimization (ERM) can outperform most existing Domain Generalization (DG) methods. ERM has achieved such strong results while only tuning hyper- parameters such as learning rate, weight decay, batch size, and dropout. We call the resulting stronger baseline ERM++.
arXiv Detail & Related papers (2023-04-04T17:31:15Z)
RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure [3.991664287163157]
RecD addresses immense storage, preprocessing, and training overheads caused by feature duplication inherent in industry-scale DLRM training datasets. We show how RecD improves the training and preprocessing throughput and storage efficiency by up to 2.48x, 1.79x, and 3.71x, respectively, in an industry-scale DLRM training system.
arXiv Detail & Related papers (2022-11-09T22:21:19Z)
Efficient Fine-Tuning of BERT Models on the Edge [12.768368718187428]
We propose Freeze And Reconfigure (FAR), a memory-efficient training regime for BERT-like models. FAR reduces fine-tuning time on the DistilBERT model and CoLA dataset by 30%, and time spent on memory operations by 47%. More broadly, reductions in metric performance on the GLUE and SQuAD datasets are around 1% on average.
arXiv Detail & Related papers (2022-05-03T14:51:53Z)
ECRM: Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding [1.418033127602866]
Deep-learning recommendation models (DLRMs) are widely deployed to serve personalized content to users. DLRMs are large in size due to their use of large embedding tables, and are trained by distributing the model across the memory of tens or hundreds of servers. Checkpointing is the primary approach used for fault tolerance in these systems, but incurs significant training-time overhead.
arXiv Detail & Related papers (2021-04-05T16:16:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.