DCcluster-Opt: Benchmarking Dynamic Multi-Objective Optimization for Geo-Distributed Data Center Workloads
- URL: http://arxiv.org/abs/2511.00117v1
- Date: Fri, 31 Oct 2025 03:07:12 GMT
- Title: DCcluster-Opt: Benchmarking Dynamic Multi-Objective Optimization for Geo-Distributed Data Center Workloads
- Authors: Antonio Guillen-Perez, Avisek Naug, Vineet Gundecha, Sahand Ghorbanpour, Ricardo Luna Gutierrez, Ashwin Ramesh Babu, Munther Salim, Shubhanker Banerjee, Eoin H. Oude Essink, Damien Fay, Soumyendu Sarkar,
- Abstract summary: We present DCcluster-Opt: an open-source, high-fidelity simulation benchmark for sustainable, geo-temporal task scheduling.<n>It combines curated real-world datasets, including AI workload traces, grid carbon intensity, electricity markets, weather across 20 global regions, cloud transmission costs, and empirical network delay parameters.<n>A modular reward system enables an explicit study of trade-offs among carbon emissions, energy costs, service level agreements, and water use.
- Score: 14.834687262955585
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The increasing energy demands and carbon footprint of large-scale AI require intelligent workload management in globally distributed data centers. Yet progress is limited by the absence of benchmarks that realistically capture the interplay of time-varying environmental factors (grid carbon intensity, electricity prices, weather), detailed data center physics (CPUs, GPUs, memory, HVAC energy), and geo-distributed network dynamics (latency and transmission costs). To bridge this gap, we present DCcluster-Opt: an open-source, high-fidelity simulation benchmark for sustainable, geo-temporal task scheduling. DCcluster-Opt combines curated real-world datasets, including AI workload traces, grid carbon intensity, electricity markets, weather across 20 global regions, cloud transmission costs, and empirical network delay parameters with physics-informed models of data center operations, enabling rigorous and reproducible research in sustainable computing. It presents a challenging scheduling problem where a top-level coordinating agent must dynamically reassign or defer tasks that arrive with resource and service-level agreement requirements across a configurable cluster of data centers to optimize multiple objectives. The environment also models advanced components such as heat recovery. A modular reward system enables an explicit study of trade-offs among carbon emissions, energy costs, service level agreements, and water use. It provides a Gymnasium API with baseline controllers, including reinforcement learning and rule-based strategies, to support reproducible ML research and a fair comparison of diverse algorithms. By offering a realistic, configurable, and accessible testbed, DCcluster-Opt accelerates the development and validation of next-generation sustainable computing solutions for geo-distributed data centers.
Related papers
- Power Grid Control with Graph-Based Distributed Reinforcement Learning [60.49805771047161]
This work advances a graph-based distributed reinforcement learning framework for real-time, scalable grid management.<n>A Graph Neural Network (GNN) is employed to encode the network's topological information within the single low-level agent's observation.<n>Experiments on the Grid2Op simulation environment show the effectiveness of the approach.
arXiv Detail & Related papers (2025-09-02T22:17:25Z) - Benchmarking LLMs' Swarm intelligence [51.648605206159125]
Large Language Models (LLMs) show potential for complex reasoning, yet their capacity for emergent coordination in Multi-Agent Systems (MAS) remains largely unexplored.<n>We introduce SwarmBench, a novel benchmark designed to systematically evaluate tasks of LLMs acting as decentralized agents.<n>We propose metrics for coordination effectiveness and analyze emergent group dynamics.
arXiv Detail & Related papers (2025-05-07T12:32:01Z) - Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey [58.50944604905037]
Edge-cloud collaborative computing (ECCC) has emerged as a pivotal paradigm for addressing the computational demands of modern intelligent applications.<n>Recent advancements in AI, particularly deep learning and large language models (LLMs), have dramatically enhanced the capabilities of these distributed systems.<n>This survey provides a structured tutorial on fundamental architectures, enabling technologies, and emerging applications.
arXiv Detail & Related papers (2025-05-03T13:55:38Z) - Hierarchical Multi-Agent Framework for Carbon-Efficient Liquid-Cooled Data Center Clusters [5.335496791443277]
This paper introduces Green-DCC, which proposes a Reinforcement Learning (RL) based hierarchical controller to optimize both workload and liquid cooling dynamically in a DCC.<n>We demonstrate how the system optimize multiple data centers synchronously, enabling the scope of digital twins, and compare the performance of various RL approaches based on carbon emissions and sustainability metrics.
arXiv Detail & Related papers (2025-02-12T12:00:58Z) - SustainDC: Benchmarking for Sustainable Data Center Control [4.159959816797259]
We introduce SustainDC, a set of Python environments for benchmarking multi-agent reinforcement learning (MARL) algorithms for data centers (DC)<n>SustainDC supports custom DC configurations and tasks such as workload scheduling, cooling optimization, and auxiliary battery management.<n>We evaluate various MARL algorithms on SustainDC, showing their performance across diverse DC designs, locations, weather conditions, grid carbon intensity, and workload requirements.
arXiv Detail & Related papers (2024-08-14T22:43:52Z) - MOSAIC: A Multi-Objective Optimization Framework for Sustainable
Datacenter Management [2.9699290794642366]
We propose a novel framework for sustainable datacenter management.
We take into account multiple geography- and time-based factors including renewable energy sources, variable energy costs, power usage efficiency, carbon factors, and water intensity in energy.
Our framework achieves a cumulative improvement across all objectives (carbon, water, cost) of up to 4.61x compared to the state-of-the-arts.
arXiv Detail & Related papers (2023-11-14T23:05:37Z) - Sustainable AIGC Workload Scheduling of Geo-Distributed Data Centers: A
Multi-Agent Reinforcement Learning Approach [48.18355658448509]
Recent breakthroughs in generative artificial intelligence have triggered a surge in demand for machine learning training, which poses significant cost burdens and environmental challenges due to its substantial energy consumption.
Scheduling training jobs among geographically distributed cloud data centers unveils the opportunity to optimize the usage of computing capacity powered by inexpensive and low-carbon energy.
We propose an algorithm based on multi-agent reinforcement learning and actor-critic methods to learn the optimal collaborative scheduling strategy through interacting with a cloud system built with real-life workload patterns, energy prices, and carbon intensities.
arXiv Detail & Related papers (2023-04-17T02:12:30Z) - Measuring the Carbon Intensity of AI in Cloud Instances [91.28501520271972]
We provide a framework for measuring software carbon intensity, and propose to measure operational carbon emissions.
We evaluate a suite of approaches for reducing emissions on the Microsoft Azure cloud compute platform.
arXiv Detail & Related papers (2022-06-10T17:04:04Z) - HUNTER: AI based Holistic Resource Management for Sustainable Cloud
Computing [26.48962351761643]
We propose an artificial intelligence (AI) based holistic resource management technique for sustainable cloud computing called HUNTER.
The proposed model formulates the goal of optimizing energy efficiency in data centers as a multi-objective scheduling problem.
Experiments on simulated and physical cloud environments show that HUNTER outperforms state-of-the-art baselines in terms of energy consumption, SLA violation, scheduling time, cost and temperature by up to 12, 35, 43, 54 and 3 percent respectively.
arXiv Detail & Related papers (2021-10-11T18:11:26Z) - A Framework for Energy and Carbon Footprint Analysis of Distributed and
Federated Edge Learning [48.63610479916003]
This article breaks down and analyzes the main factors that influence the environmental footprint of distributed learning policies.
It models both vanilla and decentralized FL policies driven by consensus.
Results show that FL allows remarkable end-to-end energy savings (30%-40%) for wireless systems characterized by low bit/Joule efficiency.
arXiv Detail & Related papers (2021-03-18T16:04:42Z) - Artificial Intelligence (AI)-Centric Management of Resources in Modern
Distributed Computing Systems [22.550075095184514]
Cloud Data Centres (DCS) are large scale, complex, heterogeneous, and distributed across multiple networks and geographical boundaries.
The Internet of Things (IoT)-driven applications are producing a huge amount of data that requires real-time processing and fast response.
Existing Resource Management Systems (RMS) rely on either static or solutions inadequate for such composite and dynamic systems.
arXiv Detail & Related papers (2020-06-09T06:54:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.