DeepPCR: Parallelizing Sequential Operations in Neural Networks
- URL: http://arxiv.org/abs/2309.16318v2
- Date: Fri, 27 Oct 2023 11:51:52 GMT
- Title: DeepPCR: Parallelizing Sequential Operations in Neural Networks
- Authors: Federico Danieli, Miguel Sarabia, Xavier Suau, Pau Rodr\'iguez, Luca
Zappella
- Abstract summary: We introduce DeepPCR, a novel algorithm which parallelizes typically sequential operations in order to speed up inference and training of neural networks.
DeepPCR is based on interpreting a sequence of $L$ steps as the solution of a specific system of equations, which we recover using the Parallel Cyclic Reduction algorithm.
To verify the theoretical lower complexity of the algorithm, and to identify regimes for speedup, we test the effectiveness of DeepPCR in parallelizing the forward and backward pass in multi-layer perceptrons.
- Score: 4.241834259165193
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Parallelization techniques have become ubiquitous for accelerating inference
and training of deep neural networks. Despite this, several operations are
still performed in a sequential manner. For instance, the forward and backward
passes are executed layer-by-layer, and the output of diffusion models is
produced by applying a sequence of denoising steps. This sequential approach
results in a computational cost proportional to the number of steps involved,
presenting a potential bottleneck as the number of steps increases. In this
work, we introduce DeepPCR, a novel algorithm which parallelizes typically
sequential operations in order to speed up inference and training of neural
networks. DeepPCR is based on interpreting a sequence of $L$ steps as the
solution of a specific system of equations, which we recover using the Parallel
Cyclic Reduction algorithm. This reduces the complexity of computing the
sequential operations from $\mathcal{O}(L)$ to $\mathcal{O}(\log_2L)$, thus
yielding a speedup for large $L$. To verify the theoretical lower complexity of
the algorithm, and to identify regimes for speedup, we test the effectiveness
of DeepPCR in parallelizing the forward and backward pass in multi-layer
perceptrons, and reach speedups of up to $30\times$ for the forward and
$200\times$ for the backward pass. We additionally showcase the flexibility of
DeepPCR by parallelizing training of ResNets with as many as 1024 layers, and
generation in diffusion models, enabling up to $7\times$ faster training and
$11\times$ faster generation, respectively, when compared to the sequential
approach.
Related papers
- PRF: Parallel Resonate and Fire Neuron for Long Sequence Learning in Spiking Neural Networks [6.545474731089018]
We address the efficiency and performance challenges of long sequence learning in Spiking Neural Networks (SNNs) simultaneously.
First, we propose a decoupled reset method for parallel spiking neuron training, reducing the typical Leaky Integrate-and-Fire (LIF) model's training time from $O(L2)$ to $O(Llog L)$.
Secondly, to capture long-range dependencies, we propose a Parallel Resonate and Fire (PRF) neuron, which leverages an oscillating membrane potential driven by a resonate mechanism from a differentiable reset function in the complex domain
arXiv Detail & Related papers (2024-10-04T15:51:56Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - PARTIME: Scalable and Parallel Processing Over Time with Deep Neural
Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time.
PARTIME starts processing each data sample at the time in which it becomes available from the stream.
Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z) - A Communication-Efficient Distributed Gradient Clipping Algorithm for
Training Deep Neural Networks [11.461878019780597]
Gradient Descent might converge slowly in some deep neural networks.
It remains mysterious whether gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup.
arXiv Detail & Related papers (2022-05-10T16:55:33Z) - Parallel Training of GRU Networks with a Multi-Grid Solver for Long
Sequences [1.9798034349981162]
We present a novel parallel training scheme (called parallel-in-time) for Gated Recurrent Unit (GRU) networks.
MGRIT partitions a sequence into multiple shorter sub-sequences and trains the sub-sequences on different processors in parallel.
Experimental results on the HMDB51 dataset, where each video is an image sequence, demonstrate that the new parallel training scheme achieves up to 6.5$times$ speedup over a serial approach.
arXiv Detail & Related papers (2022-03-07T11:32:44Z) - LayerPipe: Accelerating Deep Neural Network Training by Intra-Layer and
Inter-Layer Gradient Pipelining and Multiprocessor Scheduling [6.549125450209931]
Training model parameters by backpropagation inherently create feedback loops.
The proposed system, referred to as LayerPipe, reduces the number of clock cycles required for training.
arXiv Detail & Related papers (2021-08-14T23:51:00Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z) - Minimal Filtering Algorithms for Convolutional Neural Networks [82.24592140096622]
We develop fully parallel hardware-oriented algorithms for implementing the basic filtering operation for M=3,5,7,9, and 11.
A fully parallel hardware implementation of the proposed algorithms in each case gives approximately 30 percent savings in the number of embedded multipliers.
arXiv Detail & Related papers (2020-04-12T13:18:25Z) - Accelerating Feedforward Computation via Parallel Nonlinear Equation
Solving [106.63673243937492]
Feedforward computation, such as evaluating a neural network or sampling from an autoregressive model, is ubiquitous in machine learning.
We frame the task of feedforward computation as solving a system of nonlinear equations. We then propose to find the solution using a Jacobi or Gauss-Seidel fixed-point method, as well as hybrid methods of both.
Our method is guaranteed to give exactly the same values as the original feedforward computation with a reduced (or equal) number of parallelizable iterations, and hence reduced time given sufficient parallel computing power.
arXiv Detail & Related papers (2020-02-10T10:11:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.