ANDREAS: Artificial intelligence traiNing scheDuler foR accElerAted
resource clusterS
- URL: http://arxiv.org/abs/2105.05080v1
- Date: Tue, 11 May 2021 14:36:19 GMT
- Title: ANDREAS: Artificial intelligence traiNing scheDuler foR accElerAted
resource clusterS
- Authors: Federica Filippini, Danilo Ardagna, Marco Lattuada, Edoardo Amaldi,
Michele Ciavotta, Maciek Riedl, Katarzyna Materka, Pawe{\l} Skrzypek,
Fabrizio Magugliani, Marco Cicala
- Abstract summary: We propose ANDREAS, an advanced scheduling solution to maximize performance and minimize Data Centers operational costs.
experiments show that we can achieve a cost reduction between 30 and 62% on average with respect to first-principle methods.
- Score: 1.798617052102518
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Artificial Intelligence (AI) and Deep Learning (DL) algorithms are currently
applied to a wide range of products and solutions. DL training jobs are highly
resource demanding and they experience great benefits when exploiting AI
accelerators (e.g., GPUs). However, the effective management of GPU-powered
clusters comes with great challenges. Among these, efficient scheduling and
resource allocation solutions are crucial to maximize performance and minimize
Data Centers operational costs. In this paper we propose ANDREAS, an advanced
scheduling solution that tackles these problems jointly, aiming at optimizing
DL training runtime workloads and their energy consumption in accelerated
clusters. Experiments based on simulation demostrate that we can achieve a cost
reduction between 30 and 62% on average with respect to first-principle methods
while the validation on a real cluster shows a worst case deviation below 13%
between actual and predicted costs, proving the effectiveness of ANDREAS
solution in practical scenarios.
Related papers
- Scalable Machine Learning Training Infrastructure for Online Ads Recommendation and Auction Scoring Modeling at Google [4.0088714133342895]
Ads recommendation and auction scoring models at Google scale demand immense computational resources.
This paper proposes solutions for three critical challenges that must be addressed for efficient end-to-end execution.
arXiv Detail & Related papers (2025-01-17T20:40:56Z) - Learning for Cross-Layer Resource Allocation in MEC-Aided Cell-Free Networks [71.30914500714262]
Cross-layer resource allocation over mobile edge computing (MEC)-aided cell-free networks can sufficiently exploit the transmitting and computing resources to promote the data rate.
Joint subcarrier allocation and beamforming optimization are investigated for the MEC-aided cell-free network from the perspective of deep learning.
arXiv Detail & Related papers (2024-12-21T10:18:55Z) - Game-Theoretic Deep Reinforcement Learning to Minimize Carbon Emissions and Energy Costs for AI Inference Workloads in Geo-Distributed Data Centers [3.3379026542599934]
This work introduces a unique approach combining Game Theory (GT) and Deep Reinforcement Learning (DRL) for optimizing the distribution of AI inference workloads in geo-distributed data centers.
The proposed technique integrates the principles of non-cooperative Game Theory into a DRL framework, enabling data centers to make intelligent decisions regarding workload allocation.
arXiv Detail & Related papers (2024-04-01T20:13:28Z) - Hybrid Reinforcement Learning for Optimizing Pump Sustainability in
Real-World Water Distribution Networks [55.591662978280894]
This article addresses the pump-scheduling optimization problem to enhance real-time control of real-world water distribution networks (WDNs)
Our primary objectives are to adhere to physical operational constraints while reducing energy consumption and operational costs.
Traditional optimization techniques, such as evolution-based and genetic algorithms, often fall short due to their lack of convergence guarantees.
arXiv Detail & Related papers (2023-10-13T21:26:16Z) - A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical
Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs)
MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z) - Sustainable AIGC Workload Scheduling of Geo-Distributed Data Centers: A
Multi-Agent Reinforcement Learning Approach [48.18355658448509]
Recent breakthroughs in generative artificial intelligence have triggered a surge in demand for machine learning training, which poses significant cost burdens and environmental challenges due to its substantial energy consumption.
Scheduling training jobs among geographically distributed cloud data centers unveils the opportunity to optimize the usage of computing capacity powered by inexpensive and low-carbon energy.
We propose an algorithm based on multi-agent reinforcement learning and actor-critic methods to learn the optimal collaborative scheduling strategy through interacting with a cloud system built with real-life workload patterns, energy prices, and carbon intensities.
arXiv Detail & Related papers (2023-04-17T02:12:30Z) - Actively Learning Costly Reward Functions for Reinforcement Learning [56.34005280792013]
We show that it is possible to train agents in complex real-world environments orders of magnitudes faster.
By enabling the application of reinforcement learning methods to new domains, we show that we can find interesting and non-trivial solutions.
arXiv Detail & Related papers (2022-11-23T19:17:20Z) - Job Scheduling in Datacenters using Constraint Controlled RL [0.0]
We apply Proportional-Integral-Derivative (PID) Lagrangian methods in Deep Reinforcement Learning to job scheduling problem in the green datacenter environment.
Experiments demonstrate improved performance compared to scheduling policies without the PID Lagrangian methods.
arXiv Detail & Related papers (2022-11-10T04:43:14Z) - A Distributed Deep Reinforcement Learning Technique for Application
Placement in Edge and Fog Computing Environments [31.326505188936746]
Several Deep Reinforcement Learning (DRL)-based placement techniques have been proposed in fog/edge computing environments.
We propose an actor-critic-based distributed application placement technique, working based on the IMPortance weighted Actor-Learner Architectures (IMPALA)
arXiv Detail & Related papers (2021-10-24T11:25:03Z) - Combining Deep Learning and Optimization for Security-Constrained
Optimal Power Flow [94.24763814458686]
Security-constrained optimal power flow (SCOPF) is fundamental in power systems.
Modeling of APR within the SCOPF problem results in complex large-scale mixed-integer programs.
This paper proposes a novel approach that combines deep learning and robust optimization techniques.
arXiv Detail & Related papers (2020-07-14T12:38:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.