Scheduling Optimization Techniques for Neural Network Training
- URL: http://arxiv.org/abs/2110.00929v1
- Date: Sun, 3 Oct 2021 05:45:06 GMT
- Title: Scheduling Optimization Techniques for Neural Network Training
- Authors: Hyungjun Oh, Hyungjun Oh, HyeongJu Kim, Jiwon Seo
- Abstract summary: This paper proposes out-of-order (ooo) backprop, an effective scheduling technique for neural network training.
We show that the GPU utilization in single-GPU, data-parallel, and pipeline-parallel training can be commonly improve by applying ooo backprop.
- Score: 3.1617796705744547
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural network training requires a large amount of computation and thus GPUs
are often used for the acceleration. While they improve the performance, GPUs
are underutilized during the training.This paper proposes out-of-order (ooo)
backprop, an effective scheduling technique for neural network training. By
exploiting the dependencies of gradient computations, ooo backprop enables to
reorder their executions to make the most of the GPU resources. We show that
the GPU utilization in single-GPU, data-parallel, and pipeline-parallel
training can be commonly improve by applying ooo back-prop and prioritizing
critical operations. We propose three scheduling algorithms based on ooo
backprop. For single-GPU training, we schedule with multi-stream out-of-order
computation to mask the kernel launch overhead. In data-parallel training, we
reorder the gradient computations to maximize the overlapping of computation
and parameter communication; in pipeline-parallel training, we prioritize
critical gradient computations to reduce the pipeline stalls.We evaluate our
optimizations with twelve neural networks including a light-weight computer
vision model (MobileNet) and largeNLP models (BERT and GPT-3) with up to forty
eight V100 GPUs.Our scheduling algorithms effectively improve the performance
of single-GPU training as well as data- and pipeline-parallel training.Compared
to the respective state of the art training systems, the throughput is
substantially improved for single-GPU, data-parallel, and pipeline-parallel
training.
Related papers
- Faster Multi-GPU Training with PPLL: A Pipeline Parallelism Framework Leveraging Local Learning [8.628231789161577]
We present PPLL (Pipeline Parallelism based on Local Learning), a novel framework that leverages local learning algorithms to enable effective parallel training across multiple GPU.
By utilizing queues to manage data transfers between GPU, PPLL ensures seamless cross- GPU communication, allowing multiple blocks to execute forward and backward passes in a pipelined manner.
Our results demonstrate that PPLL significantly enhances the training speed of the local learning method while achieving comparable or even superior training speed to traditional pipeline parallelism.
arXiv Detail & Related papers (2024-11-19T08:09:18Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - Instant Neural Graphics Primitives with a Multiresolution Hash Encoding [67.33850633281803]
We present a versatile new input encoding that permits the use of a smaller network without sacrificing quality.
A small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through a gradient descent.
We achieve a combined speed of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds.
arXiv Detail & Related papers (2022-01-16T07:22:47Z) - Accelerating GAN training using highly parallel hardware on public cloud [0.3694429692322631]
This work explores different types of cloud services to train a Geneversarative Adversarial Network (GAN) in a parallel environment.
We parallelize the training process on multiple GPU and Google Processing Units (TPU)
Linear speed-up of the training process is obtained, while retaining most of the performance in terms of physics results.
arXiv Detail & Related papers (2021-11-08T16:59:15Z) - Accelerating Training and Inference of Graph Neural Networks with Fast
Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement.
We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment.
We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler.
We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z) - Large Batch Simulation for Deep Reinforcement Learning [101.01408262583378]
We accelerate deep reinforcement learning-based training in visually complex 3D environments by two orders of magnitude over prior work.
We realize end-to-end training speeds of over 19,000 frames of experience per second on a single and up to 72,000 frames per second on a single eight- GPU machine.
By combining batch simulation and performance optimizations, we demonstrate that Point navigation agents can be trained in complex 3D environments on a single GPU in 1.5 days to 97% of the accuracy of agents trained on a prior state-of-the-art system.
arXiv Detail & Related papers (2021-03-12T00:22:50Z) - Computational Performance Predictions for Deep Neural Network Training:
A Runtime-Based Approach [1.5857983167543392]
We present a new practical technique to help users make informed and cost-efficient GPU selections.
We make predictions by scaling the execution time of each operation in a training iteration from one GPU to another using either (i) wave scaling, a technique based on a GPU's execution model, or (ii) pre-trained multilayer perceptrons.
We implement our technique into a Python library called Surfer and find that it makes accurate iteration execution time predictions on ResNet-50, Inception v3, the Transformer, GNMT, and DCGAN.
arXiv Detail & Related papers (2021-01-31T20:17:46Z) - Accurate, Efficient and Scalable Training of Graph Neural Networks [9.569918335816963]
Graph Neural Networks (GNNs) are powerful deep learning models to generate node embeddings on graphs.
It is still challenging to perform training in an efficient and scalable way.
We propose a novel parallel training framework that reduces training workload by orders of magnitude compared with state-of-the-art minibatch methods.
arXiv Detail & Related papers (2020-10-05T22:06:23Z) - Understanding the Effects of Data Parallelism and Sparsity on Neural
Network Training [126.49572353148262]
We study two factors in neural network training: data parallelism and sparsity.
Despite their promising benefits, understanding of their effects on neural network training remains elusive.
arXiv Detail & Related papers (2020-03-25T10:49:22Z) - GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platforms [1.2183405753834562]
Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art deep learning model for representation learning on graphs.
It is challenging to accelerate training of GCNs due to substantial and irregular data communication.
We design a novel accelerator for training GCNs on CPU-FPGA heterogeneous systems.
arXiv Detail & Related papers (2019-12-31T21:19:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.