Dynamic GPU Energy Optimization for Machine Learning Training Workloads
- URL: http://arxiv.org/abs/2201.01684v1
- Date: Wed, 5 Jan 2022 16:25:48 GMT
- Title: Dynamic GPU Energy Optimization for Machine Learning Training Workloads
- Authors: Farui Wang, Weizhe Zhang, Shichao Lai, Meng Hao, Zheng Wang
- Abstract summary: GPOEO is an online GPU energy optimization framework for machine learning training workloads.
It employs novel techniques for online measurement, multi-objective prediction modeling, and search optimization.
Compared with the NVIDIA default scheduling strategy, GPOEO delivers a mean energy saving of 16.2% with a modest average execution time increase of 5.1%.
- Score: 9.156075372403421
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: GPUs are widely used to accelerate the training of machine learning
workloads. As modern machine learning models become increasingly larger, they
require a longer time to train, leading to higher GPU energy consumption. This
paper presents GPOEO, an online GPU energy optimization framework for machine
learning training workloads. GPOEO dynamically determines the optimal energy
configuration by employing novel techniques for online measurement,
multi-objective prediction modeling, and search optimization. To characterize
the target workload behavior, GPOEO utilizes GPU performance counters. To
reduce the performance counter profiling overhead, it uses an analytical model
to detect the training iteration change and only collects performance counter
data when an iteration shift is detected. GPOEO employs multi-objective models
based on gradient boosting and a local search algorithm to find a trade-off
between execution time and energy consumption. We evaluate the GPOEO by
applying it to 71 machine learning workloads from two AI benchmark suites
running on an NVIDIA RTX3080Ti GPU. Compared with the NVIDIA default scheduling
strategy, GPOEO delivers a mean energy saving of 16.2% with a modest average
execution time increase of 5.1%.
Related papers
- Asymmetric Masked Distillation for Pre-Training Small Foundation Models [52.56257450614992]
Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding.
This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks.
We propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding.
arXiv Detail & Related papers (2023-11-06T14:44:34Z) - Performance and Energy Consumption of Parallel Machine Learning
Algorithms [0.0]
Machine learning models have achieved remarkable success in various real-world applications.
Model training in machine learning requires large-scale data sets and multiple iterations before it can work properly.
Parallelization of training algorithms is a common strategy to speed up the process of training.
arXiv Detail & Related papers (2023-05-01T13:04:39Z) - Slapo: A Schedule Language for Progressive Optimization of Large Deep
Learning Model Training [17.556432199389615]
Slapo is a schedule language that decouples the execution of a tensor-level operator from its arithmetic definition.
We show that Slapo can improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs.
arXiv Detail & Related papers (2023-02-16T00:34:53Z) - AdaGrid: Adaptive Grid Search for Link Prediction Training Objective [58.79804082133998]
Training objective crucially influences the model's performance and generalization capabilities.
We propose Adaptive Grid Search (AdaGrid) which dynamically adjusts the edge message ratio during training.
We show that AdaGrid can boost the performance of the models up to $1.9%$ while being nine times more time-efficient than a complete search.
arXiv Detail & Related papers (2022-03-30T09:24:17Z) - Building a Performance Model for Deep Learning Recommendation Model
Training on GPUs [6.05245376098191]
We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM)
We show that both the device active time (the sum of kernel runtimes) and the device idle time are important components of the overall device time.
We propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph.
arXiv Detail & Related papers (2022-01-19T19:05:42Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - Scheduling Optimization Techniques for Neural Network Training [3.1617796705744547]
This paper proposes out-of-order (ooo) backprop, an effective scheduling technique for neural network training.
We show that the GPU utilization in single-GPU, data-parallel, and pipeline-parallel training can be commonly improve by applying ooo backprop.
arXiv Detail & Related papers (2021-10-03T05:45:06Z) - Online Evolutionary Batch Size Orchestration for Scheduling Deep
Learning Workloads in GPU Clusters [10.395955671683245]
We propose ONES, an ONline Scheduler for elastic batch size orchestration.
ONES automatically manages the elasticity of each job based on the training batch size.
We show that ONES can outperform the prior deep learning schedulers with a significantly shorter average job completion time.
arXiv Detail & Related papers (2021-08-08T14:20:05Z) - Large Batch Simulation for Deep Reinforcement Learning [101.01408262583378]
We accelerate deep reinforcement learning-based training in visually complex 3D environments by two orders of magnitude over prior work.
We realize end-to-end training speeds of over 19,000 frames of experience per second on a single and up to 72,000 frames per second on a single eight- GPU machine.
By combining batch simulation and performance optimizations, we demonstrate that Point navigation agents can be trained in complex 3D environments on a single GPU in 1.5 days to 97% of the accuracy of agents trained on a prior state-of-the-art system.
arXiv Detail & Related papers (2021-03-12T00:22:50Z) - Optimizing Memory Placement using Evolutionary Graph Reinforcement
Learning [56.83172249278467]
We introduce Evolutionary Graph Reinforcement Learning (EGRL), a method designed for large search spaces.
We train and validate our approach directly on the Intel NNP-I chip for inference.
We additionally achieve 28-78% speed-up compared to the native NNP-I compiler on all three workloads.
arXiv Detail & Related papers (2020-07-14T18:50:12Z) - How to Train Your Energy-Based Model for Regression [107.54411649704194]
Energy-based models (EBMs) have become increasingly popular within computer vision in recent years.
Recent work has applied EBMs also for regression tasks, achieving state-of-the-art performance on object detection and visual tracking.
How EBMs should be trained for best possible regression performance is not a well-studied problem.
arXiv Detail & Related papers (2020-05-04T17:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.