Harnessing Manycore Processors with Distributed Memory for Accelerated
Training of Sparse and Recurrent Models
- URL: http://arxiv.org/abs/2311.04386v1
- Date: Tue, 7 Nov 2023 23:18:35 GMT
- Title: Harnessing Manycore Processors with Distributed Memory for Accelerated
Training of Sparse and Recurrent Models
- Authors: Jan Finkbeiner, Thomas Gmeinder, Mark Pupilli, Alexander Titterton,
Emre Neftci
- Abstract summary: Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures.
We explore sparse and recurrent model training on a massively parallel multiple instruction multiple data architecture with distributed local memory.
- Score: 43.1773057439246
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current AI training infrastructure is dominated by single instruction
multiple data (SIMD) and systolic array architectures, such as Graphics
Processing Units (GPUs) and Tensor Processing Units (TPUs), that excel at
accelerating parallel workloads and dense vector matrix multiplications.
Potentially more efficient neural network models utilizing sparsity and
recurrence cannot leverage the full power of SIMD processor and are thus at a
severe disadvantage compared to today's prominent parallel architectures like
Transformers and CNNs, thereby hindering the path towards more sustainable AI.
To overcome this limitation, we explore sparse and recurrent model training on
a massively parallel multiple instruction multiple data (MIMD) architecture
with distributed local memory. We implement a training routine based on
backpropagation through time (BPTT) for the brain-inspired class of Spiking
Neural Networks (SNNs) that feature binary sparse activations. We observe a
massive advantage in using sparse activation tensors with a MIMD processor, the
Intelligence Processing Unit (IPU) compared to GPUs. On training workloads, our
results demonstrate 5-10x throughput gains compared to A100 GPUs and up to 38x
gains for higher levels of activation sparsity, without a significant slowdown
in training convergence or reduction in final model performance. Furthermore,
our results show highly promising trends for both single and multi IPU
configurations as we scale up to larger model sizes. Our work paves the way
towards more efficient, non-standard models via AI training hardware beyond
GPUs, and competitive large scale SNN models.
Related papers
- Benchmarking Edge AI Platforms for High-Performance ML Inference [0.0]
Edge computing's growing prominence, due to its ability to reduce communication latency and enable real-time processing, is promoting the rise of high-performance, heterogeneous System-on-Chip solutions.
While current approaches often involve scaling down modern hardware, the performance characteristics of neural network workloads can vary significantly.
We compare the latency and throughput of various linear algebra and neural network inference tasks across CPU-only, CPU/GPU, and CPU/NPU integrated solutions.
arXiv Detail & Related papers (2024-09-23T08:27:27Z) - Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators [0.0]
Deep Neural Networks (DNNs) are being developed, trained, and utilized, putting a strain on both advanced and limited devices.
Our solution is to implement em weight block sparsity, which is a structured sparsity that is friendly to hardware.
We will present performance estimates using accurate and complete code generation for AIE2 configuration sets (AMD Versal FPGAs) with Resnet50, Inception V3, and VGG16.
arXiv Detail & Related papers (2024-07-12T17:37:49Z) - Partitioned Neural Network Training via Synthetic Intermediate Labels [0.0]
GPU memory constraints have become a notable bottleneck in training such sizable models.
This study advocates partitioning the model across GPU and generating synthetic intermediate labels to train individual segments.
This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy.
arXiv Detail & Related papers (2024-03-17T13:06:29Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - GLEAM: Greedy Learning for Large-Scale Accelerated MRI Reconstruction [50.248694764703714]
Unrolled neural networks have recently achieved state-of-the-art accelerated MRI reconstruction.
These networks unroll iterative optimization algorithms by alternating between physics-based consistency and neural-network based regularization.
We propose Greedy LEarning for Accelerated MRI reconstruction, an efficient training strategy for high-dimensional imaging settings.
arXiv Detail & Related papers (2022-07-18T06:01:29Z) - MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services [32.278096820269816]
We present a novel MoESys that boosts efficiency in both large-scale training and inference.
Specifically, in the training procedure, the proposed MoESys adopts an Elastic MoE training strategy with 2D prefetch and Fusion communication over Hierarchical storage.
For scalable inference in a single node, MoESys builds the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference.
arXiv Detail & Related papers (2022-05-20T09:09:27Z) - Accelerating Training and Inference of Graph Neural Networks with Fast
Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement.
We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment.
We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler.
We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.