Accelerating Feedforward Computation via Parallel Nonlinear Equation
Solving
- URL: http://arxiv.org/abs/2002.03629v2
- Date: Fri, 11 Jun 2021 21:44:07 GMT
- Title: Accelerating Feedforward Computation via Parallel Nonlinear Equation
Solving
- Authors: Yang Song, Chenlin Meng, Renjie Liao, Stefano Ermon
- Abstract summary: Feedforward computation, such as evaluating a neural network or sampling from an autoregressive model, is ubiquitous in machine learning.
We frame the task of feedforward computation as solving a system of nonlinear equations. We then propose to find the solution using a Jacobi or Gauss-Seidel fixed-point method, as well as hybrid methods of both.
Our method is guaranteed to give exactly the same values as the original feedforward computation with a reduced (or equal) number of parallelizable iterations, and hence reduced time given sufficient parallel computing power.
- Score: 106.63673243937492
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Feedforward computation, such as evaluating a neural network or sampling from
an autoregressive model, is ubiquitous in machine learning. The sequential
nature of feedforward computation, however, requires a strict order of
execution and cannot be easily accelerated with parallel computing. To enable
parallelization, we frame the task of feedforward computation as solving a
system of nonlinear equations. We then propose to find the solution using a
Jacobi or Gauss-Seidel fixed-point iteration method, as well as hybrid methods
of both. Crucially, Jacobi updates operate independently on each equation and
can be executed in parallel. Our method is guaranteed to give exactly the same
values as the original feedforward computation with a reduced (or equal) number
of parallelizable iterations, and hence reduced time given sufficient parallel
computing power. Experimentally, we demonstrate the effectiveness of our
approach in accelerating (i) backpropagation of RNNs, (ii) evaluation of
DenseNets, and (iii) autoregressive sampling of MADE and PixelCNN++, with
speedup factors between 2.1 and 26 under various settings.
Related papers
- Self-Refining Diffusion Samplers: Enabling Parallelization via Parareal Iterations [53.180374639531145]
Self-Refining Diffusion Samplers (SRDS) retain sample quality and can improve latency at the cost of additional parallel compute.
We take inspiration from the Parareal algorithm, a popular numerical method for parallel-in-time integration of differential equations.
arXiv Detail & Related papers (2024-12-11T11:08:09Z) - Closing the Computational-Query Depth Gap in Parallel Stochastic Convex Optimization [26.36906884097317]
We develop a new parallel algorithm for minimizing Lipschitz, convex functions with a subgradient oracle.
Our result closes a gap between the best-known query depth and the best-known computational depth of parallel algorithms.
arXiv Detail & Related papers (2024-06-11T15:41:48Z) - Stochastic Optimization for Non-convex Problem with Inexact Hessian
Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties.
We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z) - DeepPCR: Parallelizing Sequential Operations in Neural Networks [4.241834259165193]
We introduce DeepPCR, a novel algorithm which parallelizes typically sequential operations in order to speed up inference and training of neural networks.
DeepPCR is based on interpreting a sequence of $L$ steps as the solution of a specific system of equations, which we recover using the Parallel Cyclic Reduction algorithm.
To verify the theoretical lower complexity of the algorithm, and to identify regimes for speedup, we test the effectiveness of DeepPCR in parallelizing the forward and backward pass in multi-layer perceptrons.
arXiv Detail & Related papers (2023-09-28T10:15:30Z) - Randomized Polar Codes for Anytime Distributed Machine Learning [66.46612460837147]
We present a novel distributed computing framework that is robust to slow compute nodes, and is capable of both approximate and exact computation of linear operations.
We propose a sequential decoding algorithm designed to handle real valued data while maintaining low computational complexity for recovery.
We demonstrate the potential applications of this framework in various contexts, such as large-scale matrix multiplication and black-box optimization.
arXiv Detail & Related papers (2023-09-01T18:02:04Z) - Embarrassingly Parallel Independent Training of Multi-Layer Perceptrons
with Heterogeneous Architectures [2.094821665776961]
ParallelMLPs is a procedure to enable the training of several independent Perceptron Neural Networks with a different number of neurons and activation functions in parallel.
We have assessed our algorithm in simulated datasets varying the number of samples, features and batches using 10,000 different models.
We achieved a training speedup from 1 to 4 orders of magnitude if compared to the sequential approach.
arXiv Detail & Related papers (2022-06-14T02:00:31Z) - Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via
GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer.
In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph.
Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z) - Restructuring, Pruning, and Adjustment of Deep Models for Parallel
Distributed Inference [15.720414948573753]
We consider the parallel implementation of an already-trained deep model on multiple processing nodes (a.k.a. workers)
We propose RePurpose, a layer-wise model restructuring and pruning technique that guarantees the performance of the overall parallelized model.
We show that, compared to the existing methods, RePurpose significantly improves the efficiency of the distributed inference via parallel implementation.
arXiv Detail & Related papers (2020-08-19T06:44:41Z) - CWY Parametrization: a Solution for Parallelized Optimization of
Orthogonal and Stiefel Matrices [41.57234424773276]
We introduce an efficient approach for optimization over orthogonal groups on highly parallel computation units such as GPUs or TPUs.
We further develop a novel Truncated CWY (or T-CWY) approach for Stiefel manifold parametrization.
We apply our methods to train recurrent neural network architectures in the tasks of neural machine video prediction.
arXiv Detail & Related papers (2020-04-18T17:58:43Z) - Minimal Filtering Algorithms for Convolutional Neural Networks [82.24592140096622]
We develop fully parallel hardware-oriented algorithms for implementing the basic filtering operation for M=3,5,7,9, and 11.
A fully parallel hardware implementation of the proposed algorithms in each case gives approximately 30 percent savings in the number of embedded multipliers.
arXiv Detail & Related papers (2020-04-12T13:18:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.