The Architectural Implications of Distributed Reinforcement Learning on
CPU-GPU Systems
- URL: http://arxiv.org/abs/2012.04210v1
- Date: Tue, 8 Dec 2020 04:50:05 GMT
- Title: The Architectural Implications of Distributed Reinforcement Learning on
CPU-GPU Systems
- Authors: Ahmet Inci, Evgeny Bolotin, Yaosheng Fu, Gal Dalal, Shie Mannor, David
Nellans, Diana Marculescu
- Abstract summary: We show how to improve the performance and power efficiency of RL training on CPU-GPU systems.
We quantify the overall hardware utilization on a state-of-the-art distributed RL training framework.
We also introduce a new system design metric, CPU/GPU ratio, and show how to find the optimal balance between CPU and GPU resources.
- Score: 45.479582612113205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With deep reinforcement learning (RL) methods achieving results that exceed
human capabilities in games, robotics, and simulated environments, continued
scaling of RL training is crucial to its deployment in solving complex
real-world problems. However, improving the performance scalability and power
efficiency of RL training through understanding the architectural implications
of CPU-GPU systems remains an open problem. In this work we investigate and
improve the performance and power efficiency of distributed RL training on
CPU-GPU systems by approaching the problem not solely from the GPU
microarchitecture perspective but following a holistic system-level analysis
approach. We quantify the overall hardware utilization on a state-of-the-art
distributed RL training framework and empirically identify the bottlenecks
caused by GPU microarchitectural, algorithmic, and system-level design choices.
We show that the GPU microarchitecture itself is well-balanced for
state-of-the-art RL frameworks, but further investigation reveals that the
number of actors running the environment interactions and the amount of
hardware resources available to them are the primary performance and power
efficiency limiters. To this end, we introduce a new system design metric,
CPU/GPU ratio, and show how to find the optimal balance between CPU and GPU
resources when designing scalable and efficient CPU-GPU systems for RL
training.
Related papers
- Benchmarking Edge AI Platforms for High-Performance ML Inference [0.0]
Edge computing's growing prominence, due to its ability to reduce communication latency and enable real-time processing, is promoting the rise of high-performance, heterogeneous System-on-Chip solutions.
While current approaches often involve scaling down modern hardware, the performance characteristics of neural network workloads can vary significantly.
We compare the latency and throughput of various linear algebra and neural network inference tasks across CPU-only, CPU/GPU, and CPU/NPU integrated solutions.
arXiv Detail & Related papers (2024-09-23T08:27:27Z) - SwiftRL: Towards Efficient Reinforcement Learning on Real Processing-In-Memory Systems [21.133750045141802]
Reinforcement Learning (RL) trains agents to learn optimal behavior by maximizing reward signals from experience datasets.
To overcome this, SwiftRL explores Processing-In-Memory (PIM) architectures to accelerate RL workloads.
We achieve near-linear performance scaling by implementing RL algorithms like Tabular Q-learning and SARSA on UPMEM PIM systems.
arXiv Detail & Related papers (2024-05-07T02:54:31Z) - Spreeze: High-Throughput Parallel Reinforcement Learning Framework [19.3019166138232]
Spreeze is a lightweight parallel framework for reinforcement learning.
It efficiently utilizes a single desktop hardware resource to approach the throughput limit.
It can achieve up to 15,000Hz experience sampling and 370,000Hz network update frame rate.
arXiv Detail & Related papers (2023-12-11T05:25:01Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - MSRL: Distributed Reinforcement Learning with Dataflow Fragments [16.867322708270116]
Reinforcement learning (RL) trains many agents, which is resource-intensive and must scale to large GPU clusters.
We describe MindSpore Reinforcement Learning (MSRL), a distributed RL training system that supports distribution policies that govern how RL training is parallelised and distributed on cluster resources.
MSRL introduces the new abstraction of a fragmented dataflow graph, which maps functions from an RL algorithm's training loop to parallel computational fragments.
arXiv Detail & Related papers (2022-10-03T12:34:58Z) - Improving Sample Efficiency of Value Based Models Using Attention and
Vision Transformers [52.30336730712544]
We introduce a deep reinforcement learning architecture whose purpose is to increase sample efficiency without sacrificing performance.
We propose a visually attentive model that uses transformers to learn a self-attention mechanism on the feature maps of the state representation.
We demonstrate empirically that this architecture improves sample complexity for several Atari environments, while also achieving better performance in some of the games.
arXiv Detail & Related papers (2022-02-01T19:03:03Z) - JUWELS Booster -- A Supercomputer for Large-Scale AI Research [79.02246047353273]
We present JUWELS Booster, a recently commissioned high-performance computing system at the J"ulich Supercomputing Center.
We detail its system architecture, parallel, distributed model training, and benchmarks indicating its outstanding performance.
arXiv Detail & Related papers (2021-06-30T21:37:02Z) - Off-Policy Reinforcement Learning for Efficient and Effective GAN
Architecture Search [50.40004966087121]
We introduce a new reinforcement learning based neural architecture search (NAS) methodology for generative adversarial network (GAN) architecture search.
The key idea is to formulate the GAN architecture search problem as a Markov decision process (MDP) for smoother architecture sampling.
We exploit an off-policy GAN architecture search algorithm that makes efficient use of the samples generated by previous policies.
arXiv Detail & Related papers (2020-07-17T18:29:17Z) - Heterogeneous CPU+GPU Stochastic Gradient Descent Algorithms [1.3249453757295084]
We study training algorithms for deep learning on heterogeneous CPU+GPU architectures.
Our two-fold objective -- maximize convergence rate and resource utilization simultaneously -- makes the problem challenging.
We show that the implementation of these algorithms achieves both faster convergence and higher resource utilization than on several real datasets.
arXiv Detail & Related papers (2020-04-19T05:21:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.