A Reinforcement Learning Approach for Performance-aware Reduction in
Power Consumption of Data Center Compute Nodes
- URL: http://arxiv.org/abs/2308.08069v1
- Date: Tue, 15 Aug 2023 23:25:52 GMT
- Title: A Reinforcement Learning Approach for Performance-aware Reduction in
Power Consumption of Data Center Compute Nodes
- Authors: Akhilesh Raj, Swann Perarnau, Aniruddha Gokhale
- Abstract summary: We use Reinforcement Learning to design a power capping policy on cloud compute nodes.
We show how a trained agent running on actual hardware can take actions by balancing power consumption and application performance.
- Score: 0.46040036610482665
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As Exascale computing becomes a reality, the energy needs of compute nodes in
cloud data centers will continue to grow. A common approach to reducing this
energy demand is to limit the power consumption of hardware components when
workloads are experiencing bottlenecks elsewhere in the system. However,
designing a resource controller capable of detecting and limiting power
consumption on-the-fly is a complex issue and can also adversely impact
application performance. In this paper, we explore the use of Reinforcement
Learning (RL) to design a power capping policy on cloud compute nodes using
observations on current power consumption and instantaneous application
performance (heartbeats). By leveraging the Argo Node Resource Management (NRM)
software stack in conjunction with the Intel Running Average Power Limit (RAPL)
hardware control mechanism, we design an agent to control the maximum supplied
power to processors without compromising on application performance. Employing
a Proximal Policy Optimization (PPO) agent to learn an optimal policy on a
mathematical model of the compute nodes, we demonstrate and evaluate using the
STREAM benchmark how a trained agent running on actual hardware can take
actions by balancing power consumption and application performance.
Related papers
- WattScope: Non-intrusive Application-level Power Disaggregation in
Datacenters [0.6086160084025234]
WattScope is a system for non-intrusive estimating the power consumption of individual applications.
WattScope adapts and extends a machine learning-based technique for disaggregating building power.
arXiv Detail & Related papers (2023-09-22T04:13:46Z) - Sustainable AIGC Workload Scheduling of Geo-Distributed Data Centers: A
Multi-Agent Reinforcement Learning Approach [48.18355658448509]
Recent breakthroughs in generative artificial intelligence have triggered a surge in demand for machine learning training, which poses significant cost burdens and environmental challenges due to its substantial energy consumption.
Scheduling training jobs among geographically distributed cloud data centers unveils the opportunity to optimize the usage of computing capacity powered by inexpensive and low-carbon energy.
We propose an algorithm based on multi-agent reinforcement learning and actor-critic methods to learn the optimal collaborative scheduling strategy through interacting with a cloud system built with real-life workload patterns, energy prices, and carbon intensities.
arXiv Detail & Related papers (2023-04-17T02:12:30Z) - Distributed-Training-and-Execution Multi-Agent Reinforcement Learning
for Power Control in HetNet [48.96004919910818]
We propose a multi-agent deep reinforcement learning (MADRL) based power control scheme for the HetNet.
To promote cooperation among agents, we develop a penalty-based Q learning (PQL) algorithm for MADRL systems.
In this way, an agent's policy can be learned by other agents more easily, resulting in a more efficient collaboration process.
arXiv Detail & Related papers (2022-12-15T17:01:56Z) - Precise Energy Consumption Measurements of Heterogeneous Artificial
Intelligence Workloads [0.534434568021034]
We present measurements of the energy consumption of two typical applications of deep learning models on different types of compute nodes.
One advantage of our approach is that the information on energy consumption is available to all users of the supercomputer.
arXiv Detail & Related papers (2022-12-03T21:40:55Z) - Distributed Energy Management and Demand Response in Smart Grids: A
Multi-Agent Deep Reinforcement Learning Framework [53.97223237572147]
This paper presents a multi-agent Deep Reinforcement Learning (DRL) framework for autonomous control and integration of renewable energy resources into smart power grid systems.
In particular, the proposed framework jointly considers demand response (DR) and distributed energy management (DEM) for residential end-users.
arXiv Detail & Related papers (2022-11-29T01:18:58Z) - Deep Reinforcement Learning Based Multidimensional Resource Management
for Energy Harvesting Cognitive NOMA Communications [64.1076645382049]
Combination of energy harvesting (EH), cognitive radio (CR), and non-orthogonal multiple access (NOMA) is a promising solution to improve energy efficiency.
In this paper, we study the spectrum, energy, and time resource management for deterministic-CR-NOMA IoT systems.
arXiv Detail & Related papers (2021-09-17T08:55:48Z) - Power Modeling for Effective Datacenter Planning and Compute Management [53.41102502425513]
We discuss two classes of statistical power models designed and validated to be accurate, simple, interpretable and applicable to all hardware configurations and workloads.
We demonstrate that the proposed statistical modeling techniques, while simple and scalable, predict power with less than 5% Mean Absolute Percent Error (MAPE) for more than 95% diverse Power Distribution Units (more than 2000) using only 4 features.
arXiv Detail & Related papers (2021-03-22T21:22:51Z) - Intelligent colocation of HPC workloads [0.0]
Many HPC applications suffer from a bottleneck in the shared caches, instruction execution units, I/O or memory bandwidth, even though the remaining resources may be underutilized.
It is hard for developers and runtime systems to ensure that all critical resources are fully exploited by a single application, so an attractive technique is to colocate multiple applications on the same server.
We show that server efficiency can be improved by first modeling the expected performance degradation of colocated applications based on measured hardware performance counters.
arXiv Detail & Related papers (2021-03-16T12:35:35Z) - Edge Intelligence for Energy-efficient Computation Offloading and
Resource Allocation in 5G Beyond [7.953533529450216]
5G beyond is an end-edge-cloud orchestrated network that can exploit heterogeneous capabilities of the end devices, edge servers, and the cloud.
In multi user wireless networks, diverse application requirements and the possibility of various radio access modes for communication among devices make it challenging to design an optimal computation offloading scheme.
Deep Reinforcement Learning (DRL) is an emerging technique to address such an issue with limited and less accurate network information.
arXiv Detail & Related papers (2020-11-17T05:51:03Z) - Reinforcement Learning on Computational Resource Allocation of
Cloud-based Wireless Networks [22.06811314358283]
Wireless networks used for Internet of Things (IoT) are expected to largely involve cloud-based computing and processing.
In a cloud environment, dynamic computational resource allocation is essential to save energy while maintaining the performance of the processes.
This paper models this dynamic computational resource allocation problem into a Markov Decision Process (MDP) and designs a model-based reinforcement-learning agent to optimise the dynamic resource allocation of the CPU usage.
The results show that our agent rapidly converges to the optimal policy, stably performs in different settings, outperforms or at least equally performs compared to a baseline algorithm in energy savings for different scenarios.
arXiv Detail & Related papers (2020-10-10T15:16:26Z) - Risk-Aware Energy Scheduling for Edge Computing with Microgrid: A
Multi-Agent Deep Reinforcement Learning Approach [82.6692222294594]
We study a risk-aware energy scheduling problem for a microgrid-powered MEC network.
We derive the solution by applying a multi-agent deep reinforcement learning (MADRL)-based advantage actor-critic (A3C) algorithm with shared neural networks.
arXiv Detail & Related papers (2020-02-21T02:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.