Sustainable AIGC Workload Scheduling of Geo-Distributed Data Centers: A
Multi-Agent Reinforcement Learning Approach
- URL: http://arxiv.org/abs/2304.07948v1
- Date: Mon, 17 Apr 2023 02:12:30 GMT
- Title: Sustainable AIGC Workload Scheduling of Geo-Distributed Data Centers: A
Multi-Agent Reinforcement Learning Approach
- Authors: Siyue Zhang, Minrui Xu, Wei Yang Bryan Lim, and Dusit Niyato
- Abstract summary: Recent breakthroughs in generative artificial intelligence have triggered a surge in demand for machine learning training, which poses significant cost burdens and environmental challenges due to its substantial energy consumption.
Scheduling training jobs among geographically distributed cloud data centers unveils the opportunity to optimize the usage of computing capacity powered by inexpensive and low-carbon energy.
We propose an algorithm based on multi-agent reinforcement learning and actor-critic methods to learn the optimal collaborative scheduling strategy through interacting with a cloud system built with real-life workload patterns, energy prices, and carbon intensities.
- Score: 48.18355658448509
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent breakthroughs in generative artificial intelligence have triggered a
surge in demand for machine learning training, which poses significant cost
burdens and environmental challenges due to its substantial energy consumption.
Scheduling training jobs among geographically distributed cloud data centers
unveils the opportunity to optimize the usage of computing capacity powered by
inexpensive and low-carbon energy and address the issue of workload imbalance.
To tackle the challenge of multi-objective scheduling, i.e., maximizing GPU
utilization while reducing operational costs, we propose an algorithm based on
multi-agent reinforcement learning and actor-critic methods to learn the
optimal collaborative scheduling strategy through interacting with a cloud
system built with real-life workload patterns, energy prices, and carbon
intensities. Compared with other algorithms, our proposed method improves the
system utility by up to 28.6% attributable to higher GPU utilization, lower
energy cost, and less carbon emission.
Related papers
- Reinforcement Learning for Adaptive Resource Scheduling in Complex System Environments [8.315191578007857]
This study presents a novel computer system performance optimization and adaptive workload management scheduling algorithm based on Q-learning.
By contrast, Q-learning, a reinforcement learning algorithm, continuously learns from system state changes, enabling dynamic scheduling and resource optimization.
This research provides a foundation for the integration of AI-driven adaptive scheduling in future large-scale systems, offering a scalable, intelligent solution to enhance system performance, reduce operating costs, and support sustainable energy consumption.
arXiv Detail & Related papers (2024-11-08T05:58:09Z) - Game-Theoretic Deep Reinforcement Learning to Minimize Carbon Emissions and Energy Costs for AI Inference Workloads in Geo-Distributed Data Centers [3.3379026542599934]
This work introduces a unique approach combining Game Theory (GT) and Deep Reinforcement Learning (DRL) for optimizing the distribution of AI inference workloads in geo-distributed data centers.
The proposed technique integrates the principles of non-cooperative Game Theory into a DRL framework, enabling data centers to make intelligent decisions regarding workload allocation.
arXiv Detail & Related papers (2024-04-01T20:13:28Z) - DClEVerNet: Deep Combinatorial Learning for Efficient EV Charging
Scheduling in Large-scale Networked Facilities [5.78463306498655]
Electric vehicles (EVs) might stress distribution networks significantly, leaving their performance degraded and jeopardized stability.
Modern power grids require coordinated or smart'' charging strategies capable of optimizing EV charging scheduling in a scalable and efficient fashion.
We formulate a time-coupled binary optimization problem that maximizes EV users' total welfare gain while accounting for the network's available power capacity and stations' occupancy limits.
arXiv Detail & Related papers (2023-05-18T14:03:47Z) - Actively Learning Costly Reward Functions for Reinforcement Learning [56.34005280792013]
We show that it is possible to train agents in complex real-world environments orders of magnitudes faster.
By enabling the application of reinforcement learning methods to new domains, we show that we can find interesting and non-trivial solutions.
arXiv Detail & Related papers (2022-11-23T19:17:20Z) - Measuring the Carbon Intensity of AI in Cloud Instances [91.28501520271972]
We provide a framework for measuring software carbon intensity, and propose to measure operational carbon emissions.
We evaluate a suite of approaches for reducing emissions on the Microsoft Azure cloud compute platform.
arXiv Detail & Related papers (2022-06-10T17:04:04Z) - HUNTER: AI based Holistic Resource Management for Sustainable Cloud
Computing [26.48962351761643]
We propose an artificial intelligence (AI) based holistic resource management technique for sustainable cloud computing called HUNTER.
The proposed model formulates the goal of optimizing energy efficiency in data centers as a multi-objective scheduling problem.
Experiments on simulated and physical cloud environments show that HUNTER outperforms state-of-the-art baselines in terms of energy consumption, SLA violation, scheduling time, cost and temperature by up to 12, 35, 43, 54 and 3 percent respectively.
arXiv Detail & Related papers (2021-10-11T18:11:26Z) - Energy-Efficient Multi-Orchestrator Mobile Edge Learning [54.28419430315478]
Mobile Edge Learning (MEL) is a collaborative learning paradigm that features distributed training of Machine Learning (ML) models over edge devices.
In MEL, possible coexistence of multiple learning tasks with different datasets may arise.
We propose lightweight algorithms that can achieve near-optimal performance and facilitate the trade-offs between energy consumption, accuracy, and solution complexity.
arXiv Detail & Related papers (2021-09-02T07:37:10Z) - ANDREAS: Artificial intelligence traiNing scheDuler foR accElerAted
resource clusterS [1.798617052102518]
We propose ANDREAS, an advanced scheduling solution to maximize performance and minimize Data Centers operational costs.
experiments show that we can achieve a cost reduction between 30 and 62% on average with respect to first-principle methods.
arXiv Detail & Related papers (2021-05-11T14:36:19Z) - Demand-Side Scheduling Based on Multi-Agent Deep Actor-Critic Learning
for Smart Grids [56.35173057183362]
We consider the problem of demand-side energy management, where each household is equipped with a smart meter that is able to schedule home appliances online.
The goal is to minimize the overall cost under a real-time pricing scheme.
We propose the formulation of a smart grid environment as a Markov game.
arXiv Detail & Related papers (2020-05-05T07:32:40Z) - Risk-Aware Energy Scheduling for Edge Computing with Microgrid: A
Multi-Agent Deep Reinforcement Learning Approach [82.6692222294594]
We study a risk-aware energy scheduling problem for a microgrid-powered MEC network.
We derive the solution by applying a multi-agent deep reinforcement learning (MADRL)-based advantage actor-critic (A3C) algorithm with shared neural networks.
arXiv Detail & Related papers (2020-02-21T02:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.