Related papers: Dynamic GPU Energy Optimization for Machine Learning Training Workloads

Dynamic GPU Energy Optimization for Machine Learning Training Workloads

URL: http://arxiv.org/abs/2201.01684v1
Date: Wed, 5 Jan 2022 16:25:48 GMT
Title: Dynamic GPU Energy Optimization for Machine Learning Training Workloads
Authors: Farui Wang, Weizhe Zhang, Shichao Lai, Meng Hao, Zheng Wang
Abstract summary: GPOEO is an online GPU energy optimization framework for machine learning training workloads. It employs novel techniques for online measurement, multi-objective prediction modeling, and search optimization. Compared with the NVIDIA default scheduling strategy, GPOEO delivers a mean energy saving of 16.2% with a modest average execution time increase of 5.1%.
Score: 9.156075372403421
License: http://creativecommons.org/licenses/by/4.0/
Abstract: GPUs are widely used to accelerate the training of machine learning workloads. As modern machine learning models become increasingly larger, they require a longer time to train, leading to higher GPU energy consumption. This paper presents GPOEO, an online GPU energy optimization framework for machine learning training workloads. GPOEO dynamically determines the optimal energy configuration by employing novel techniques for online measurement, multi-objective prediction modeling, and search optimization. To characterize the target workload behavior, GPOEO utilizes GPU performance counters. To reduce the performance counter profiling overhead, it uses an analytical model to detect the training iteration change and only collects performance counter data when an iteration shift is detected. GPOEO employs multi-objective models based on gradient boosting and a local search algorithm to find a trade-off between execution time and energy consumption. We evaluate the GPOEO by applying it to 71 machine learning workloads from two AI benchmark suites running on an NVIDIA RTX3080Ti GPU. Compared with the NVIDIA default scheduling strategy, GPOEO delivers a mean energy saving of 16.2% with a modest average execution time increase of 5.1%.

Related papers

Accurate GPU Memory Prediction for Deep Learning Jobs through Dynamic Analysis [0.3867363075280544]
Out-of-Memory errors present a primary impediment to model training and efficient resource utilization. VeritasEst is an entirely CPU-based analysis tool capable of accurately predicting the peak GPU memory required for Deep Learning training tasks. Its performance was validated through thousands of experimental runs across convolutional neural network (CNN) models.
arXiv Detail & Related papers (2025-04-04T19:20:03Z)
AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation. Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads. We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z)
Asymmetric Masked Distillation for Pre-Training Small Foundation Models [52.56257450614992]
Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding. This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks. We propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding.
arXiv Detail & Related papers (2023-11-06T14:44:34Z)
Performance and Energy Consumption of Parallel Machine Learning Algorithms [0.0]
Machine learning models have achieved remarkable success in various real-world applications. Model training in machine learning requires large-scale data sets and multiple iterations before it can work properly. Parallelization of training algorithms is a common strategy to speed up the process of training.
arXiv Detail & Related papers (2023-05-01T13:04:39Z)
Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training [17.556432199389615]
Slapo is a schedule language that decouples the execution of a tensor-level operator from its arithmetic definition. We show that Slapo can improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs.
arXiv Detail & Related papers (2023-02-16T00:34:53Z)
AdaGrid: Adaptive Grid Search for Link Prediction Training Objective [58.79804082133998]
Training objective crucially influences the model's performance and generalization capabilities. We propose Adaptive Grid Search (AdaGrid) which dynamically adjusts the edge message ratio during training. We show that AdaGrid can boost the performance of the models up to $1.9%$ while being nine times more time-efficient than a complete search.
arXiv Detail & Related papers (2022-03-30T09:24:17Z)
Building a Performance Model for Deep Learning Recommendation Model Training on GPUs [6.05245376098191]
We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM) We show that both the device active time (the sum of kernel runtimes) and the device idle time are important components of the overall device time. We propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph.
arXiv Detail & Related papers (2022-01-19T19:05:42Z)
Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z)
Scheduling Optimization Techniques for Neural Network Training [3.1617796705744547]
This paper proposes out-of-order (ooo) backprop, an effective scheduling technique for neural network training. We show that the GPU utilization in single-GPU, data-parallel, and pipeline-parallel training can be commonly improve by applying ooo backprop.
arXiv Detail & Related papers (2021-10-03T05:45:06Z)
Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters [10.395955671683245]
We propose ONES, an ONline Scheduler for elastic batch size orchestration. ONES automatically manages the elasticity of each job based on the training batch size. We show that ONES can outperform the prior deep learning schedulers with a significantly shorter average job completion time.
arXiv Detail & Related papers (2021-08-08T14:20:05Z)
Large Batch Simulation for Deep Reinforcement Learning [101.01408262583378]
We accelerate deep reinforcement learning-based training in visually complex 3D environments by two orders of magnitude over prior work. We realize end-to-end training speeds of over 19,000 frames of experience per second on a single and up to 72,000 frames per second on a single eight- GPU machine. By combining batch simulation and performance optimizations, we demonstrate that Point navigation agents can be trained in complex 3D environments on a single GPU in 1.5 days to 97% of the accuracy of agents trained on a prior state-of-the-art system.
arXiv Detail & Related papers (2021-03-12T00:22:50Z)
Optimizing Memory Placement using Evolutionary Graph Reinforcement Learning [56.83172249278467]
We introduce Evolutionary Graph Reinforcement Learning (EGRL), a method designed for large search spaces. We train and validate our approach directly on the Intel NNP-I chip for inference. We additionally achieve 28-78% speed-up compared to the native NNP-I compiler on all three workloads.
arXiv Detail & Related papers (2020-07-14T18:50:12Z)
How to Train Your Energy-Based Model for Regression [107.54411649704194]
Energy-based models (EBMs) have become increasingly popular within computer vision in recent years. Recent work has applied EBMs also for regression tasks, achieving state-of-the-art performance on object detection and visual tracking. How EBMs should be trained for best possible regression performance is not a well-studied problem.
arXiv Detail & Related papers (2020-05-04T17:55:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.