ANDREAS: Artificial intelligence traiNing scheDuler foR accElerAted
resource clusterS
- URL: http://arxiv.org/abs/2105.05080v1
- Date: Tue, 11 May 2021 14:36:19 GMT
- Title: ANDREAS: Artificial intelligence traiNing scheDuler foR accElerAted
resource clusterS
- Authors: Federica Filippini, Danilo Ardagna, Marco Lattuada, Edoardo Amaldi,
Michele Ciavotta, Maciek Riedl, Katarzyna Materka, Pawe{\l} Skrzypek,
Fabrizio Magugliani, Marco Cicala
- Abstract summary: We propose ANDREAS, an advanced scheduling solution to maximize performance and minimize Data Centers operational costs.
experiments show that we can achieve a cost reduction between 30 and 62% on average with respect to first-principle methods.
- Score: 1.798617052102518
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Artificial Intelligence (AI) and Deep Learning (DL) algorithms are currently
applied to a wide range of products and solutions. DL training jobs are highly
resource demanding and they experience great benefits when exploiting AI
accelerators (e.g., GPUs). However, the effective management of GPU-powered
clusters comes with great challenges. Among these, efficient scheduling and
resource allocation solutions are crucial to maximize performance and minimize
Data Centers operational costs. In this paper we propose ANDREAS, an advanced
scheduling solution that tackles these problems jointly, aiming at optimizing
DL training runtime workloads and their energy consumption in accelerated
clusters. Experiments based on simulation demostrate that we can achieve a cost
reduction between 30 and 62% on average with respect to first-principle methods
while the validation on a real cluster shows a worst case deviation below 13%
between actual and predicted costs, proving the effectiveness of ANDREAS
solution in practical scenarios.
Related papers
- DNN Partitioning, Task Offloading, and Resource Allocation in Dynamic Vehicular Networks: A Lyapunov-Guided Diffusion-Based Reinforcement Learning Approach [49.56404236394601]
We formulate the problem of joint DNN partitioning, task offloading, and resource allocation in Vehicular Edge Computing.
Our objective is to minimize the DNN-based task completion time while guaranteeing the system stability over time.
We propose a Multi-Agent Diffusion-based Deep Reinforcement Learning (MAD2RL) algorithm, incorporating the innovative use of diffusion models.
arXiv Detail & Related papers (2024-06-11T06:31:03Z) - Game-Theoretic Deep Reinforcement Learning to Minimize Carbon Emissions and Energy Costs for AI Inference Workloads in Geo-Distributed Data Centers [3.3379026542599934]
This work introduces a unique approach combining Game Theory (GT) and Deep Reinforcement Learning (DRL) for optimizing the distribution of AI inference workloads in geo-distributed data centers.
The proposed technique integrates the principles of non-cooperative Game Theory into a DRL framework, enabling data centers to make intelligent decisions regarding workload allocation.
arXiv Detail & Related papers (2024-04-01T20:13:28Z) - Snapshot Reinforcement Learning: Leveraging Prior Trajectories for
Efficiency [6.267119107674013]
Deep reinforcement learning (DRL) algorithms require substantial samples and computational resources to achieve higher performance.
We present the Snapshot Reinforcement Learning framework, which enhances sample efficiency by simply altering environments.
We propose a simple and effective SnapshotRL baseline algorithm, S3RL, which integrates well with existing DRL algorithms.
arXiv Detail & Related papers (2024-03-01T17:05:22Z) - Hybrid Reinforcement Learning for Optimizing Pump Sustainability in
Real-World Water Distribution Networks [55.591662978280894]
This article addresses the pump-scheduling optimization problem to enhance real-time control of real-world water distribution networks (WDNs)
Our primary objectives are to adhere to physical operational constraints while reducing energy consumption and operational costs.
Traditional optimization techniques, such as evolution-based and genetic algorithms, often fall short due to their lack of convergence guarantees.
arXiv Detail & Related papers (2023-10-13T21:26:16Z) - A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical
Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs)
MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z) - Sustainable AIGC Workload Scheduling of Geo-Distributed Data Centers: A
Multi-Agent Reinforcement Learning Approach [48.18355658448509]
Recent breakthroughs in generative artificial intelligence have triggered a surge in demand for machine learning training, which poses significant cost burdens and environmental challenges due to its substantial energy consumption.
Scheduling training jobs among geographically distributed cloud data centers unveils the opportunity to optimize the usage of computing capacity powered by inexpensive and low-carbon energy.
We propose an algorithm based on multi-agent reinforcement learning and actor-critic methods to learn the optimal collaborative scheduling strategy through interacting with a cloud system built with real-life workload patterns, energy prices, and carbon intensities.
arXiv Detail & Related papers (2023-04-17T02:12:30Z) - Actively Learning Costly Reward Functions for Reinforcement Learning [56.34005280792013]
We show that it is possible to train agents in complex real-world environments orders of magnitudes faster.
By enabling the application of reinforcement learning methods to new domains, we show that we can find interesting and non-trivial solutions.
arXiv Detail & Related papers (2022-11-23T19:17:20Z) - Job Scheduling in Datacenters using Constraint Controlled RL [0.0]
We apply Proportional-Integral-Derivative (PID) Lagrangian methods in Deep Reinforcement Learning to job scheduling problem in the green datacenter environment.
Experiments demonstrate improved performance compared to scheduling policies without the PID Lagrangian methods.
arXiv Detail & Related papers (2022-11-10T04:43:14Z) - A Distributed Deep Reinforcement Learning Technique for Application
Placement in Edge and Fog Computing Environments [31.326505188936746]
Several Deep Reinforcement Learning (DRL)-based placement techniques have been proposed in fog/edge computing environments.
We propose an actor-critic-based distributed application placement technique, working based on the IMPortance weighted Actor-Learner Architectures (IMPALA)
arXiv Detail & Related papers (2021-10-24T11:25:03Z) - Combining Deep Learning and Optimization for Security-Constrained
Optimal Power Flow [94.24763814458686]
Security-constrained optimal power flow (SCOPF) is fundamental in power systems.
Modeling of APR within the SCOPF problem results in complex large-scale mixed-integer programs.
This paper proposes a novel approach that combines deep learning and robust optimization techniques.
arXiv Detail & Related papers (2020-07-14T12:38:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.