LearningGroup: A Real-Time Sparse Training on FPGA via Learnable Weight
Grouping for Multi-Agent Reinforcement Learning
- URL: http://arxiv.org/abs/2210.16624v1
- Date: Sat, 29 Oct 2022 15:09:34 GMT
- Title: LearningGroup: A Real-Time Sparse Training on FPGA via Learnable Weight
Grouping for Multi-Agent Reinforcement Learning
- Authors: Je Yang, JaeUk Kim and Joo-Young Kim
- Abstract summary: Multi-agent reinforcement learning (MARL) is a powerful technology to construct interactive artificial intelligent systems.
We present a real-time sparse training acceleration system named LearningGroup.
Our system minimizes the cycle time and memory footprint for sparse data generation up to 5.72x and 6.81x.
- Score: 2.0625936401496237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-agent reinforcement learning (MARL) is a powerful technology to
construct interactive artificial intelligent systems in various applications
such as multi-robot control and self-driving cars. Unlike supervised model or
single-agent reinforcement learning, which actively exploits network pruning,
it is obscure that how pruning will work in multi-agent reinforcement learning
with its cooperative and interactive characteristics. \par In this paper, we
present a real-time sparse training acceleration system named LearningGroup,
which adopts network pruning on the training of MARL for the first time with an
algorithm/architecture co-design approach. We create sparsity using a weight
grouping algorithm and propose on-chip sparse data encoding loop (OSEL) that
enables fast encoding with efficient implementation. Based on the OSEL's
encoding format, LearningGroup performs efficient weight compression and
computation workload allocation to multiple cores, where each core handles
multiple sparse rows of the weight matrix simultaneously with vector processing
units. As a result, LearningGroup system minimizes the cycle time and memory
footprint for sparse data generation up to 5.72x and 6.81x. Its FPGA
accelerator shows 257.40-3629.48 GFLOPS throughput and 7.10-100.12 GFLOPS/W
energy efficiency for various conditions in MARL, which are 7.13x higher and
12.43x more energy efficient than Nvidia Titan RTX GPU, thanks to the fully
on-chip training and highly optimized dataflow/data format provided by FPGA.
Most importantly, the accelerator shows speedup up to 12.52x for processing
sparse data over the dense case, which is the highest among state-of-the-art
sparse training accelerators.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators [0.0]
Deep Neural Networks (DNNs) are being developed, trained, and utilized, putting a strain on both advanced and limited devices.
Our solution is to implement em weight block sparsity, which is a structured sparsity that is friendly to hardware.
We will present performance estimates using accurate and complete code generation for AIE2 configuration sets (AMD Versal FPGAs) with Resnet50, Inception V3, and VGG16.
arXiv Detail & Related papers (2024-07-12T17:37:49Z) - Harnessing Manycore Processors with Distributed Memory for Accelerated
Training of Sparse and Recurrent Models [43.1773057439246]
Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures.
We explore sparse and recurrent model training on a massively parallel multiple instruction multiple data architecture with distributed local memory.
arXiv Detail & Related papers (2023-11-07T23:18:35Z) - Reconfigurable Distributed FPGA Cluster Design for Deep Learning
Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications.
The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z) - RedMule: A Mixed-Precision Matrix-Matrix Operation Engine for Flexible
and Energy-Efficient On-Chip Linear Algebra and TinyML Training Acceleration [15.869673535117032]
Current training algorithms rely on floating-point matrix operations to meet the precision and dynamic range requirements.
RedMulE is a low-power specialized accelerator conceived for multi-precision floating-point General Matrix-Matrix Operations (GEMM-Ops) acceleration.
RedMulE achieves up to 58.5 GFLOPS and 117 GFLOPS for FP16 and FP8, respectively, with 99.4% utilization of the array of Computing Elements.
arXiv Detail & Related papers (2023-01-10T11:07:16Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs)
Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z) - A Deep Learning Inference Scheme Based on Pipelined Matrix
Multiplication Acceleration Design and Non-uniform Quantization [9.454905560571085]
We introduce a low-power Multi-layer Perceptron (MLP) accelerator based on a pipelined matrix multiplication scheme and a nonuniform quantization methodology.
Results show that our method can achieve better performance with fewer power consumption.
arXiv Detail & Related papers (2021-10-10T17:31:27Z) - Accelerating Distributed K-FAC with Smart Parallelism of Computing and
Communication Tasks [13.552262050816616]
Kronecker-Factored Approximate Curvature (KFAC) is one of the most efficient approximation algorithms for training deep models.
Yet, when leveraging GPU clusters to train models with KFAC, it incurs extensive computation as well as introduces extra communications during each iteration.
We propose D-KFAC with smart parallelism of computing and communication tasks to reduce the iteration time.
arXiv Detail & Related papers (2021-07-14T08:01:07Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.