Sustainable Supercomputing for AI: GPU Power Capping at HPC Scale
- URL: http://arxiv.org/abs/2402.18593v1
- Date: Sun, 25 Feb 2024 02:22:34 GMT
- Title: Sustainable Supercomputing for AI: GPU Power Capping at HPC Scale
- Authors: Dan Zhao, Siddharth Samsi, Joseph McDonald, Baolin Li, David Bestor,
Michael Jones, Devesh Tiwari, Vijay Gadepally
- Abstract summary: Recent large language models require considerable resources to train and deploy.
With the right amount of power-capping, we show significant decreases in both temperature and power draw.
Our work is the first to conduct and make available a detailed analysis of the effects of GPU power-capping at the supercomputing scale.
- Score: 20.30679358575365
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As research and deployment of AI grows, the computational burden to support
and sustain its progress inevitably does too. To train or fine-tune
state-of-the-art models in NLP, computer vision, etc., some form of AI hardware
acceleration is virtually a requirement. Recent large language models require
considerable resources to train and deploy, resulting in significant energy
usage, potential carbon emissions, and massive demand for GPUs and other
hardware accelerators. However, this surge carries large implications for
energy sustainability at the HPC/datacenter level. In this paper, we study the
aggregate effect of power-capping GPUs on GPU temperature and power draw at a
research supercomputing center. With the right amount of power-capping, we show
significant decreases in both temperature and power draw, reducing power
consumption and potentially improving hardware life-span with minimal impact on
job performance. While power-capping reduces power draw by design, the
aggregate system-wide effect on overall energy consumption is less clear; for
instance, if users notice job performance degradation from GPU power-caps, they
may request additional GPU-jobs to compensate, negating any energy savings or
even worsening energy consumption. To our knowledge, our work is the first to
conduct and make available a detailed analysis of the effects of GPU
power-capping at the supercomputing scale. We hope our work will inspire
HPCs/datacenters to further explore, evaluate, and communicate the impact of
power-capping AI hardware accelerators for more sustainable AI.
Related papers
- Online Energy Optimization in GPUs: A Multi-Armed Bandit Approach [15.28157695259566]
Energy consumption has become a critical design metric and a limiting factor in the development of future computing architectures.
This paper studies a novel and practical online energy optimization problem for GPU in HPC scenarios.
EnergyUCB is designed to dynamically adjust GPU core frequencies in real-time, reducing energy consumption with minimal impact on performance.
arXiv Detail & Related papers (2024-10-03T17:05:34Z) - On the Opportunities of Green Computing: A Survey [80.21955522431168]
Artificial Intelligence (AI) has achieved significant advancements in technology and research with the development over several decades.
The needs for high computing power brings higher carbon emission and undermines research fairness.
To tackle the challenges of computing resources and environmental impact of AI, Green Computing has become a hot research topic.
arXiv Detail & Related papers (2023-11-01T11:16:41Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Energy Concerns with HPC Systems and Applications [0.0]
em energy has become a critical concern in all relevant activities and technical designs.
For the specific case of computer activities, the problem is exacerbated with the emergence and pervasiveness of the so called em intelligent devices
There are mainly two contexts where em energy is one of the top priority concerns: em embedded computing and em supercomputing.
arXiv Detail & Related papers (2023-08-31T08:33:42Z) - Non-Intrusive Electric Load Monitoring Approach Based on Current Feature
Visualization for Smart Energy Management [51.89904044860731]
We employ computer vision techniques of AI to design a non-invasive load monitoring method for smart electric energy management.
We propose to recognize all electric loads from color feature images using a U-shape deep neural network with multi-scale feature extraction and attention mechanism.
arXiv Detail & Related papers (2023-08-08T04:52:19Z) - Precise Energy Consumption Measurements of Heterogeneous Artificial
Intelligence Workloads [0.534434568021034]
We present measurements of the energy consumption of two typical applications of deep learning models on different types of compute nodes.
One advantage of our approach is that the information on energy consumption is available to all users of the supercomputer.
arXiv Detail & Related papers (2022-12-03T21:40:55Z) - Great Power, Great Responsibility: Recommendations for Reducing Energy
for Training Language Models [8.927248087602942]
We investigate techniques that can be used to reduce the energy consumption of common NLP applications.
These techniques can lead to significant reduction in energy consumption when training language models or their use for inference.
arXiv Detail & Related papers (2022-05-19T16:03:55Z) - The Ecological Footprint of Neural Machine Translation Systems [2.132096006921048]
This chapter focuses on the ecological footprint of neural MT systems.
It starts from the power drain during the training of and the inference with neural MT models and moves towards the environment impact.
The overall CO2 offload is calculated for Ireland and the Netherlands.
arXiv Detail & Related papers (2022-02-04T14:56:41Z) - Compute and Energy Consumption Trends in Deep Learning Inference [67.32875669386488]
We study relevant models in the areas of computer vision and natural language processing.
For a sustained increase in performance we see a much softer growth in energy consumption than previously anticipated.
arXiv Detail & Related papers (2021-09-12T09:40:18Z) - JUWELS Booster -- A Supercomputer for Large-Scale AI Research [79.02246047353273]
We present JUWELS Booster, a recently commissioned high-performance computing system at the J"ulich Supercomputing Center.
We detail its system architecture, parallel, distributed model training, and benchmarks indicating its outstanding performance.
arXiv Detail & Related papers (2021-06-30T21:37:02Z) - The Architectural Implications of Distributed Reinforcement Learning on
CPU-GPU Systems [45.479582612113205]
We show how to improve the performance and power efficiency of RL training on CPU-GPU systems.
We quantify the overall hardware utilization on a state-of-the-art distributed RL training framework.
We also introduce a new system design metric, CPU/GPU ratio, and show how to find the optimal balance between CPU and GPU resources.
arXiv Detail & Related papers (2020-12-08T04:50:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.