Optimizing Data Collection in Deep Reinforcement Learning
- URL: http://arxiv.org/abs/2207.07736v1
- Date: Fri, 15 Jul 2022 20:22:31 GMT
- Title: Optimizing Data Collection in Deep Reinforcement Learning
- Authors: James Gleeson, Daniel Snider, Yvonne Yang, Moshe Gabel, Eyal de Lara,
Gennady Pekhimenko
- Abstract summary: GPU vectorization can achieve up to $1024times$ speedup over commonly used CPU simulators.
We show that simulator kernel fusion speedups with a simple simulator are $11.3times$ and increase by up to $1024times$ as simulator complexity increases in terms of memory bandwidth requirements.
- Score: 4.9709347068704455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning (RL) workloads take a notoriously long time to train
due to the large number of samples collected at run-time from simulators.
Unfortunately, cluster scale-up approaches remain expensive, and commonly used
CPU implementations of simulators induce high overhead when switching back and
forth between GPU computations. We explore two optimizations that increase RL
data collection efficiency by increasing GPU utilization: (1) GPU
vectorization: parallelizing simulation on the GPU for increased hardware
parallelism, and (2) simulator kernel fusion: fusing multiple simulation steps
to run in a single GPU kernel launch to reduce global memory bandwidth
requirements. We find that GPU vectorization can achieve up to $1024\times$
speedup over commonly used CPU simulators. We profile the performance of
different implementations and show that for a simple simulator, ML compiler
implementations (XLA) of GPU vectorization outperform a DNN framework (PyTorch)
by $13.4\times$ by reducing CPU overhead from repeated Python to DL backend API
calls. We show that simulator kernel fusion speedups with a simple simulator
are $11.3\times$ and increase by up to $1024\times$ as simulator complexity
increases in terms of memory bandwidth requirements. We show that the speedups
from simulator kernel fusion are orthogonal and combinable with GPU
vectorization, leading to a multiplicative speedup.
Related papers
- Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading [2.8231000588510757]
Transformers and large language models(LLMs) have seen rapid adoption in all domains.
Training of transformers is very expensive and often hits a memory wall''
We propose a novel technique to split the LLM into subgroups, whose update phase is scheduled on either the CPU or the GPU.
arXiv Detail & Related papers (2024-10-26T00:43:59Z) - PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning.
However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware.
PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement
Learning on a GPU [15.337470862838794]
We present WarpDrive, a flexible, lightweight, and easy-to-use open-source RL framework that implements end-to-end multi-agent RL on a single GPU.
Our design runs simulations and the agents in each simulation in parallel. It also uses a single simulation data store on the GPU that is safely updated in-place.
WarpDrive yields 2.9 million environment steps/second with 2000 environments and 1000 agents (at least 100x higher throughput compared to a CPU implementation) in a benchmark Tag simulation.
arXiv Detail & Related papers (2021-08-31T16:59:27Z) - BayesSimIG: Scalable Parameter Inference for Adaptive Domain
Randomization with IsaacGym [59.53949960353792]
BayesSimIG is a library that provides an implementation of BayesSim integrated with the recently released NVIDIA IsaacGym.
BayesSimIG provides an integration with NVIDIABoard to easily visualize slices of high-dimensional posteriors.
arXiv Detail & Related papers (2021-07-09T16:21:31Z) - Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters.
It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions.
We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z) - Large Batch Simulation for Deep Reinforcement Learning [101.01408262583378]
We accelerate deep reinforcement learning-based training in visually complex 3D environments by two orders of magnitude over prior work.
We realize end-to-end training speeds of over 19,000 frames of experience per second on a single and up to 72,000 frames per second on a single eight- GPU machine.
By combining batch simulation and performance optimizations, we demonstrate that Point navigation agents can be trained in complex 3D environments on a single GPU in 1.5 days to 97% of the accuracy of agents trained on a prior state-of-the-art system.
arXiv Detail & Related papers (2021-03-12T00:22:50Z) - Multi-GPU SNN Simulation with Perfect Static Load Balancing [0.8360870648463651]
We present a SNN simulator which scales to millions of neurons, billions of synapses, and 8 GPUs.
This is made possible by 1) a novel, cache-aware spike transmission algorithm 2) a model parallel multi- GPU distribution scheme and 3) a static, yet very effective load balancing strategy.
arXiv Detail & Related papers (2021-02-09T07:07:34Z) - GPU-Accelerated Primal Learning for Extremely Fast Large-Scale
Classification [10.66048003460524]
One of the most efficient methods to solve L2-regularized primal problems, such as logistic regression and linear support vector machine (SVM) classification, is the widely used trust region Newton algorithm, TRON.
We show that using judicious GPU-optimization principles, TRON training time for different losses and feature representations may be drastically reduced.
arXiv Detail & Related papers (2020-08-08T03:40:27Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.