Rearchitecting Datacenter Lifecycle for AI: A TCO-Driven Framework
- URL: http://arxiv.org/abs/2509.26534v1
- Date: Tue, 30 Sep 2025 17:08:51 GMT
- Title: Rearchitecting Datacenter Lifecycle for AI: A TCO-Driven Framework
- Authors: Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Ricardo Bianchini,
- Abstract summary: We show how design choices in power, cooling, and networking provisioning impact long-term TCO.<n>We also explore refresh strategies aligned with hardware trends.<n>Our system reduces the TCO by up to 40% over traditional approaches.
- Score: 5.927989356089395
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The rapid rise of large language models (LLMs) has been driving an enormous demand for AI inference infrastructure, mainly powered by high-end GPUs. While these accelerators offer immense computational power, they incur high capital and operational costs due to frequent upgrades, dense power consumption, and cooling demands, making total cost of ownership (TCO) for AI datacenters a critical concern for cloud providers. Unfortunately, traditional datacenter lifecycle management (designed for general-purpose workloads) struggles to keep pace with AI's fast-evolving models, rising resource needs, and diverse hardware profiles. In this paper, we rethink the AI datacenter lifecycle scheme across three stages: building, hardware refresh, and operation. We show how design choices in power, cooling, and networking provisioning impact long-term TCO. We also explore refresh strategies aligned with hardware trends. Finally, we use operation software optimizations to reduce cost. While these optimizations at each stage yield benefits, unlocking the full potential requires rethinking the entire lifecycle. Thus, we present a holistic lifecycle management framework that coordinates and co-optimizes decisions across all three stages, accounting for workload dynamics, hardware evolution, and system aging. Our system reduces the TCO by up to 40\% over traditional approaches. Using our framework we provide guidelines on how to manage AI datacenter lifecycle for the future.
Related papers
- Toward Large-Scale Photonics-Empowered AI Systems: From Physical Design Automation to System-Algorithm Co-Exploration [5.036634263468385]
SimPhony provides implementation-aware modeling and rapid cross-layer evaluation.<n>ADEPT and ADEPT-Z enable end-to-end circuit and topology exploration.<n>Apollo and LiDAR provide scalable photonic physical design automation.
arXiv Detail & Related papers (2025-12-31T22:21:42Z) - LeJOT: An Intelligent Job Cost Orchestration Solution for Databricks Platform [28.16213013287002]
We introduce LeJOT, an intelligent job cost orchestration framework for Databricks jobs.<n>LeJOT proactively predicts workload demands, dynamically allocates computing resources, and minimizes costs.<n>We show that LeJOT achieves an average 20% reduction in cloud computing costs within a minute-level scheduling timeframe.
arXiv Detail & Related papers (2025-12-20T08:09:58Z) - DCcluster-Opt: Benchmarking Dynamic Multi-Objective Optimization for Geo-Distributed Data Center Workloads [14.834687262955585]
We present DCcluster-Opt: an open-source, high-fidelity simulation benchmark for sustainable, geo-temporal task scheduling.<n>It combines curated real-world datasets, including AI workload traces, grid carbon intensity, electricity markets, weather across 20 global regions, cloud transmission costs, and empirical network delay parameters.<n>A modular reward system enables an explicit study of trade-offs among carbon emissions, energy costs, service level agreements, and water use.
arXiv Detail & Related papers (2025-10-31T03:07:12Z) - Improving AI Efficiency in Data Centres by Power Dynamic Response [74.12165648170894]
The steady growth of artificial intelligence (AI) has accelerated in the recent years, facilitated by the development of sophisticated models.<n> Ensuring robust and reliable power infrastructures is fundamental to take advantage of the full potential of AI.<n>However, AI data centres are extremely hungry for power, putting the problem of their power management in the spotlight.
arXiv Detail & Related papers (2025-10-13T08:08:21Z) - AI Data Centers Need Pioneers to Deliver Scalable Power via Offgrid AI [0.0]
Our time demands a new revolution in scalable energy, mirroring in key ways the scalable computing revolution.<n>The offgrid AI approach combines local mostly renewable generation and storage to power an AI data center, starting offgrid.<n>I argue that the offgrid-AI approach needs pioneers among both system developers and AI-data-center operators to move it quickly from concept to large-scale deployment.
arXiv Detail & Related papers (2025-08-25T17:13:30Z) - Intelligent Mobile AI-Generated Content Services via Interactive Prompt Engineering and Dynamic Service Provisioning [55.641299901038316]
AI-generated content can organize collaborative Mobile AIGC Service Providers (MASPs) at network edges to provide ubiquitous and customized content for resource-constrained users.<n>Such a paradigm faces two significant challenges: 1) raw prompts often lead to poor generation quality due to users' lack of experience with specific AIGC models, and 2) static service provisioning fails to efficiently utilize computational and communication resources.<n>We develop an interactive prompt engineering mechanism that leverages a Large Language Model (LLM) to generate customized prompt corpora and employs Inverse Reinforcement Learning (IRL) for policy imitation.
arXiv Detail & Related papers (2025-02-17T03:05:20Z) - Beyond Efficiency: Scaling AI Sustainably [4.711003829305544]
Modern AI applications have driven ever-increasing demands in computing.
This paper characterizes the carbon impact of AI, including both operational carbon emissions from training and inference as well as embodied carbon emissions from hardware manufacturing.
arXiv Detail & Related papers (2024-06-08T00:07:16Z) - Game-Theoretic Deep Reinforcement Learning to Minimize Carbon Emissions and Energy Costs for AI Inference Workloads in Geo-Distributed Data Centers [3.3379026542599934]
This work introduces a unique approach combining Game Theory (GT) and Deep Reinforcement Learning (DRL) for optimizing the distribution of AI inference workloads in geo-distributed data centers.
The proposed technique integrates the principles of non-cooperative Game Theory into a DRL framework, enabling data centers to make intelligent decisions regarding workload allocation.
arXiv Detail & Related papers (2024-04-01T20:13:28Z) - Green Edge AI: A Contemporary Survey [46.11332733210337]
The transformative power of AI is derived from the utilization of deep neural networks (DNNs)
Deep learning (DL) is increasingly being transitioned to wireless edge networks in proximity to end-user devices (EUDs)
Despite its potential, edge AI faces substantial challenges, mostly due to the dichotomy between the resource limitations of wireless edge networks and the resource-intensive nature of DL.
arXiv Detail & Related papers (2023-12-01T04:04:37Z) - Power Hungry Processing: Watts Driving the Cost of AI Deployment? [74.19749699665216]
generative, multi-purpose AI systems promise a unified approach to building machine learning (ML) models into technology.
This ambition of generality'' comes at a steep cost to the environment, given the amount of energy these systems require and the amount of carbon that they emit.
We measure deployment cost as the amount of energy and carbon required to perform 1,000 inferences on representative benchmark dataset using these models.
We conclude with a discussion around the current trend of deploying multi-purpose generative ML systems, and caution that their utility should be more intentionally weighed against increased costs in terms of energy and emissions
arXiv Detail & Related papers (2023-11-28T15:09:36Z) - Sustainable AIGC Workload Scheduling of Geo-Distributed Data Centers: A
Multi-Agent Reinforcement Learning Approach [48.18355658448509]
Recent breakthroughs in generative artificial intelligence have triggered a surge in demand for machine learning training, which poses significant cost burdens and environmental challenges due to its substantial energy consumption.
Scheduling training jobs among geographically distributed cloud data centers unveils the opportunity to optimize the usage of computing capacity powered by inexpensive and low-carbon energy.
We propose an algorithm based on multi-agent reinforcement learning and actor-critic methods to learn the optimal collaborative scheduling strategy through interacting with a cloud system built with real-life workload patterns, energy prices, and carbon intensities.
arXiv Detail & Related papers (2023-04-17T02:12:30Z) - The Future of Consumer Edge-AI Computing [58.445652425379855]
Deep Learning has rapidly infiltrated the consumer end, mainly thanks to hardware acceleration across devices.
As we look towards the future, it is evident that isolated hardware will be insufficient.
We introduce a novel paradigm centered around EdgeAI-Hub devices, designed to reorganise and optimise compute resources and data access at the consumer edge.
arXiv Detail & Related papers (2022-10-19T12:41:47Z) - Innovations in the field of on-board scheduling technologies [64.41511459132334]
This paper proposes an onboard scheduler, that integrates inside an onboard software framework for mission autonomy.
The scheduler is based on linear integer programming and relies on the use of a branch-and-cut solver.
The technology has been tested on an Earth Observation scenario, comparing its performance against the state-of-the-art scheduling technology.
arXiv Detail & Related papers (2022-05-04T12:00:49Z) - HUNTER: AI based Holistic Resource Management for Sustainable Cloud
Computing [26.48962351761643]
We propose an artificial intelligence (AI) based holistic resource management technique for sustainable cloud computing called HUNTER.
The proposed model formulates the goal of optimizing energy efficiency in data centers as a multi-objective scheduling problem.
Experiments on simulated and physical cloud environments show that HUNTER outperforms state-of-the-art baselines in terms of energy consumption, SLA violation, scheduling time, cost and temperature by up to 12, 35, 43, 54 and 3 percent respectively.
arXiv Detail & Related papers (2021-10-11T18:11:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.