Related papers: SustainDC: Benchmarking for Sustainable Data Center Control

SustainDC: Benchmarking for Sustainable Data Center Control

URL: http://arxiv.org/abs/2408.07841v4
Date: Sat, 19 Oct 2024 07:20:57 GMT
Title: SustainDC: Benchmarking for Sustainable Data Center Control
Authors: Avisek Naug, Antonio Guillen, Ricardo Luna, Vineet Gundecha, Desik Rengarajan, Sahand Ghorbanpour, Sajad Mousavi, Ashwin Ramesh Babu, Dejan Markovikj, Lekhapriya D Kashyap, Soumyendu Sarkar,
Abstract summary: We introduce SustainDC, a set of Python environments for benchmarking multi-agent reinforcement learning (MARL) algorithms for data centers (DC) SustainDC supports custom DC configurations and tasks such as workload scheduling, cooling optimization, and auxiliary battery management. We evaluate various MARL algorithms on SustainDC, showing their performance across diverse DC designs, locations, weather conditions, grid carbon intensity, and workload requirements.
Score: 4.159959816797259
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine learning has driven an exponential increase in computational demand, leading to massive data centers that consume significant amounts of energy and contribute to climate change. This makes sustainable data center control a priority. In this paper, we introduce SustainDC, a set of Python environments for benchmarking multi-agent reinforcement learning (MARL) algorithms for data centers (DC). SustainDC supports custom DC configurations and tasks such as workload scheduling, cooling optimization, and auxiliary battery management, with multiple agents managing these operations while accounting for the effects of each other. We evaluate various MARL algorithms on SustainDC, showing their performance across diverse DC designs, locations, weather conditions, grid carbon intensity, and workload requirements. Our results highlight significant opportunities for improvement of data center operations using MARL algorithms. Given the increasing use of DC due to AI, SustainDC provides a crucial platform for the development and benchmarking of advanced algorithms essential for achieving sustainable computing and addressing other heterogeneous real-world challenges.

Related papers

Transforming Future Data Center Operations and Management via Physical AI [24.063748316223343]
Data centers as mission-critical infrastructures are pivotal in powering the growth of artificial intelligence (AI) and the digital economy. The evolution from Internet DC to AI DC has introduced new challenges in operating and managing data centers for improved business resilience and reduced total cost of ownership. We propose and develop a novel Physical AI (PhyAI) framework for advancing DC operations and management.
arXiv Detail & Related papers (2025-04-07T12:09:22Z)
Hierarchical Multi-Agent Framework for Carbon-Efficient Liquid-Cooled Data Center Clusters [5.335496791443277]
This paper introduces Green-DCC, which proposes a Reinforcement Learning (RL) based hierarchical controller to optimize both workload and liquid cooling dynamically in a DCC. We demonstrate how the system optimize multiple data centers synchronously, enabling the scope of digital twins, and compare the performance of various RL approaches based on carbon emissions and sustainability metrics.
arXiv Detail & Related papers (2025-02-12T12:00:58Z)
Beyond Efficiency: Scaling AI Sustainably [4.711003829305544]
Modern AI applications have driven ever-increasing demands in computing. This paper characterizes the carbon impact of AI, including both operational carbon emissions from training and inference as well as embodied carbon emissions from hardware manufacturing.
arXiv Detail & Related papers (2024-06-08T00:07:16Z)
Sustainability of Data Center Digital Twins with Reinforcement Learning [2.4971633082970377]
Machine learning (ML) has led to an increased demand for computational power, resulting in larger data centers (DCs) and higher energy consumption. To address this issue and reduce carbon emissions, intelligent design and control of DC components such as IT servers, cabinets, HVAC cooling, flexible load shifting, and battery energy storage are essential. DCRL-Green is a multi-agent RL environment that empowers the ML community to design data centers and research, develop, and refine RL controllers for carbon footprint reduction in DCs.
arXiv Detail & Related papers (2024-04-16T18:22:30Z)
Carbon Footprint Reduction for Sustainable Data Centers in Real-Time [2.794742330785396]
We propose a Data Center Carbon Footprint Reduction (DC-CFR) multi-agent Reinforcement Learning (MARL) framework to optimize data centers for the objectives of carbon footprint reduction, energy consumption, and energy cost. The results show that the DC-CFR MARL agents effectively resolved the complex interdependencies in optimizing cooling, load shifting, and energy storage in real-time for various locations under real-world dynamic weather and grid carbon intensity conditions.
arXiv Detail & Related papers (2024-03-21T02:59:56Z)
CUDC: A Curiosity-Driven Unsupervised Data Collection Method with Adaptive Temporal Distances for Offline Reinforcement Learning [62.58375643251612]
We propose a Curiosity-driven Unsupervised Data Collection (CUDC) method to expand feature space using adaptive temporal distances for task-agnostic data collection. With this adaptive reachability mechanism in place, the feature representation can be diversified, and the agent can navigate itself to collect higher-quality data with curiosity. Empirically, CUDC surpasses existing unsupervised methods in efficiency and learning performance in various downstream offline RL tasks of the DeepMind control suite.
arXiv Detail & Related papers (2023-12-19T14:26:23Z)
MOSAIC: A Multi-Objective Optimization Framework for Sustainable Datacenter Management [2.9699290794642366]
We propose a novel framework for sustainable datacenter management. We take into account multiple geography- and time-based factors including renewable energy sources, variable energy costs, power usage efficiency, carbon factors, and water intensity in energy. Our framework achieves a cumulative improvement across all objectives (carbon, water, cost) of up to 4.61x compared to the state-of-the-arts.
arXiv Detail & Related papers (2023-11-14T23:05:37Z)
Sustainable AIGC Workload Scheduling of Geo-Distributed Data Centers: A Multi-Agent Reinforcement Learning Approach [48.18355658448509]
Recent breakthroughs in generative artificial intelligence have triggered a surge in demand for machine learning training, which poses significant cost burdens and environmental challenges due to its substantial energy consumption. Scheduling training jobs among geographically distributed cloud data centers unveils the opportunity to optimize the usage of computing capacity powered by inexpensive and low-carbon energy. We propose an algorithm based on multi-agent reinforcement learning and actor-critic methods to learn the optimal collaborative scheduling strategy through interacting with a cloud system built with real-life workload patterns, energy prices, and carbon intensities.
arXiv Detail & Related papers (2023-04-17T02:12:30Z)
DC-Check: A Data-Centric AI checklist to guide the development of reliable machine learning systems [81.21462458089142]
Data-centric AI is emerging as a unifying paradigm that could enable reliable end-to-end pipelines. We propose DC-Check, an actionable checklist-style framework to elicit data-centric considerations. This data-centric lens on development aims to promote thoughtfulness and transparency prior to system development.
arXiv Detail & Related papers (2022-11-09T17:32:09Z)
Data augmentation through multivariate scenario forecasting in Data Centers using Generative Adversarial Networks [0.18416014644193063]
The main challenge in achieving a global energy efficiency strategy based on Artificial Intelligence is that we need massive amounts of data to feed the algorithms. This paper proposes a time-series data augmentation methodology based on synthetic scenario forecasting within the Data Center. Our research will help to optimize the energy consumed in Data Centers, although the proposed methodology can be employed in any similar time-series-like problem.
arXiv Detail & Related papers (2022-01-12T15:09:10Z)
Power Modeling for Effective Datacenter Planning and Compute Management [53.41102502425513]
We discuss two classes of statistical power models designed and validated to be accurate, simple, interpretable and applicable to all hardware configurations and workloads. We demonstrate that the proposed statistical modeling techniques, while simple and scalable, predict power with less than 5% Mean Absolute Percent Error (MAPE) for more than 95% diverse Power Distribution Units (more than 2000) using only 4 features.
arXiv Detail & Related papers (2021-03-22T21:22:51Z)
Reinforcement Learning for Datacenter Congestion Control [50.225885814524304]
Successful congestion control algorithms can dramatically improve latency and overall network throughput. Until today, no such learning-based algorithms have shown practical potential in this domain. We devise an RL-based algorithm with the aim of generalizing to different configurations of real-world datacenter networks. We show that this scheme outperforms alternative popular RL approaches, and generalizes to scenarios that were not seen during training.
arXiv Detail & Related papers (2021-02-18T13:49:28Z)
Dif-MAML: Decentralized Multi-Agent Meta-Learning [54.39661018886268]
We propose a cooperative multi-agent meta-learning algorithm, referred to as MAML or Dif-MAML. We show that the proposed strategy allows a collection of agents to attain agreement at a linear rate and to converge to a stationary point of the aggregate MAML. Simulation results illustrate the theoretical findings and the superior performance relative to the traditional non-cooperative setting.
arXiv Detail & Related papers (2020-10-06T16:51:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.