Accelerating Deep Neuroevolution on Distributed FPGAs for Reinforcement
Learning Problems
- URL: http://arxiv.org/abs/2005.04536v1
- Date: Sun, 10 May 2020 00:41:39 GMT
- Title: Accelerating Deep Neuroevolution on Distributed FPGAs for Reinforcement
Learning Problems
- Authors: Alexis Asseman, Nicolas Antoine and Ahmet S. Ozcan
- Abstract summary: We report record training times (running at about 1 million frames per second) for Atari 2600 games using deep neuroevolution implemented on distributed FPGAs.
Results are the first application demonstration on the IBM Neural Computer.
- Score: 0.7734726150561088
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning augmented by the representational power of deep neural
networks, has shown promising results on high-dimensional problems, such as
game playing and robotic control. However, the sequential nature of these
problems poses a fundamental challenge for computational efficiency. Recently,
alternative approaches such as evolutionary strategies and deep neuroevolution
demonstrated competitive results with faster training time on distributed CPU
cores. Here, we report record training times (running at about 1 million frames
per second) for Atari 2600 games using deep neuroevolution implemented on
distributed FPGAs. Combined hardware implementation of the game console, image
pre-processing and the neural network in an optimized pipeline, multiplied with
the system level parallelism enabled the acceleration. These results are the
first application demonstration on the IBM Neural Computer, which is a custom
designed system that consists of 432 Xilinx FPGAs interconnected in a 3D mesh
network topology. In addition to high performance, experiments also showed
improvement in accuracy for all games compared to the CPU-implementation of the
same algorithm.
Related papers
- Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - GLEAM: Greedy Learning for Large-Scale Accelerated MRI Reconstruction [50.248694764703714]
Unrolled neural networks have recently achieved state-of-the-art accelerated MRI reconstruction.
These networks unroll iterative optimization algorithms by alternating between physics-based consistency and neural-network based regularization.
We propose Greedy LEarning for Accelerated MRI reconstruction, an efficient training strategy for high-dimensional imaging settings.
arXiv Detail & Related papers (2022-07-18T06:01:29Z) - FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs)
Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z) - FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task.
The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources.
It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z) - N3H-Core: Neuron-designed Neural Network Accelerator via FPGA-based
Heterogeneous Computing Cores [26.38812379700231]
We develop an FPGA-based heterogeneous computing system for neural network acceleration.
The proposed accelerator consists of DSP- and LUT-based GEneral Matrix-Multiplication (GEMM) computing cores.
Our design outperforms the state-of-the-art Mix&Match design with latency reduced by 1.12-1.32x with higher inference accuracy.
arXiv Detail & Related papers (2021-12-15T15:12:00Z) - Accelerating Training and Inference of Graph Neural Networks with Fast
Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement.
We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment.
We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler.
We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z) - Overview of FPGA deep learning acceleration based on convolutional
neural network [0.76146285961466]
In recent years, deep learning has become more and more mature, and as a commonly used algorithm in deep learning, convolutional neural networks have been widely used in various visual tasks.
This article is a review article, which mainly introduces the related theories and algorithms of convolution.
It summarizes the application scenarios of several existing FPGA technologies based on convolutional neural networks, and mainly introduces the application of accelerators.
arXiv Detail & Related papers (2020-12-23T12:44:24Z) - Toward Accurate Platform-Aware Performance Modeling for Deep Neural
Networks [0.17499351967216337]
We provide a machine learning-based method, PerfNetV2, which improves the accuracy of our previous work for modeling the neural network performance on a variety of GPU accelerators.
Given an application, the proposed method can be used to predict the inference time and training time of the convolutional neural networks used in the application.
Our case studies show that PerfNetV2 yields a mean absolute percentage error within 13.1% on LeNet, AlexNet, and VGG16 on NVIDIA GTX-1080Ti, while the error rate on a previous work published in ICBD 2018 could be as large as 200%.
arXiv Detail & Related papers (2020-12-01T01:42:23Z) - Optimizing Memory Placement using Evolutionary Graph Reinforcement
Learning [56.83172249278467]
We introduce Evolutionary Graph Reinforcement Learning (EGRL), a method designed for large search spaces.
We train and validate our approach directly on the Intel NNP-I chip for inference.
We additionally achieve 28-78% speed-up compared to the native NNP-I compiler on all three workloads.
arXiv Detail & Related papers (2020-07-14T18:50:12Z) - Benchmarking Deep Spiking Neural Networks on Neuromorphic Hardware [0.0]
We use the methodology of converting pre-trained non-spiking to spiking neural networks to evaluate the performance loss and measure the energy-per-inference.
We demonstrate that the conversion loss is usually below one percent for digital implementations, and moderately higher for analog systems with the benefit of much lower energy-per-inference costs.
arXiv Detail & Related papers (2020-04-03T16:25:49Z) - GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platforms [1.2183405753834562]
Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art deep learning model for representation learning on graphs.
It is challenging to accelerate training of GCNs due to substantial and irregular data communication.
We design a novel accelerator for training GCNs on CPU-FPGA heterogeneous systems.
arXiv Detail & Related papers (2019-12-31T21:19:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.