Accelerating Feedforward Computation via Parallel Nonlinear Equation
Solving
- URL: http://arxiv.org/abs/2002.03629v2
- Date: Fri, 11 Jun 2021 21:44:07 GMT
- Title: Accelerating Feedforward Computation via Parallel Nonlinear Equation
Solving
- Authors: Yang Song, Chenlin Meng, Renjie Liao, Stefano Ermon
- Abstract summary: Feedforward computation, such as evaluating a neural network or sampling from an autoregressive model, is ubiquitous in machine learning.
We frame the task of feedforward computation as solving a system of nonlinear equations. We then propose to find the solution using a Jacobi or Gauss-Seidel fixed-point method, as well as hybrid methods of both.
Our method is guaranteed to give exactly the same values as the original feedforward computation with a reduced (or equal) number of parallelizable iterations, and hence reduced time given sufficient parallel computing power.
- Score: 106.63673243937492
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Feedforward computation, such as evaluating a neural network or sampling from
an autoregressive model, is ubiquitous in machine learning. The sequential
nature of feedforward computation, however, requires a strict order of
execution and cannot be easily accelerated with parallel computing. To enable
parallelization, we frame the task of feedforward computation as solving a
system of nonlinear equations. We then propose to find the solution using a
Jacobi or Gauss-Seidel fixed-point iteration method, as well as hybrid methods
of both. Crucially, Jacobi updates operate independently on each equation and
can be executed in parallel. Our method is guaranteed to give exactly the same
values as the original feedforward computation with a reduced (or equal) number
of parallelizable iterations, and hence reduced time given sufficient parallel
computing power. Experimentally, we demonstrate the effectiveness of our
approach in accelerating (i) backpropagation of RNNs, (ii) evaluation of
DenseNets, and (iii) autoregressive sampling of MADE and PixelCNN++, with
speedup factors between 2.1 and 26 under various settings.
Related papers
- Closing the Computational-Query Depth Gap in Parallel Stochastic Convex Optimization [26.36906884097317]
We develop a new parallel algorithm for minimizing Lipschitz, convex functions with a subgradient oracle.
Our result closes a gap between the best-known query depth and the best-known computational depth of parallel algorithms.
arXiv Detail & Related papers (2024-06-11T15:41:48Z) - Stochastic Optimization for Non-convex Problem with Inexact Hessian
Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties.
We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z) - DeepPCR: Parallelizing Sequential Operations in Neural Networks [4.241834259165193]
We introduce DeepPCR, a novel algorithm which parallelizes typically sequential operations in order to speed up inference and training of neural networks.
DeepPCR is based on interpreting a sequence of $L$ steps as the solution of a specific system of equations, which we recover using the Parallel Cyclic Reduction algorithm.
To verify the theoretical lower complexity of the algorithm, and to identify regimes for speedup, we test the effectiveness of DeepPCR in parallelizing the forward and backward pass in multi-layer perceptrons.
arXiv Detail & Related papers (2023-09-28T10:15:30Z) - Randomized Polar Codes for Anytime Distributed Machine Learning [66.46612460837147]
We present a novel distributed computing framework that is robust to slow compute nodes, and is capable of both approximate and exact computation of linear operations.
We propose a sequential decoding algorithm designed to handle real valued data while maintaining low computational complexity for recovery.
We demonstrate the potential applications of this framework in various contexts, such as large-scale matrix multiplication and black-box optimization.
arXiv Detail & Related papers (2023-09-01T18:02:04Z) - Embarrassingly Parallel Independent Training of Multi-Layer Perceptrons
with Heterogeneous Architectures [2.094821665776961]
ParallelMLPs is a procedure to enable the training of several independent Perceptron Neural Networks with a different number of neurons and activation functions in parallel.
We have assessed our algorithm in simulated datasets varying the number of samples, features and batches using 10,000 different models.
We achieved a training speedup from 1 to 4 orders of magnitude if compared to the sequential approach.
arXiv Detail & Related papers (2022-06-14T02:00:31Z) - Fast Differentiable Matrix Square Root and Inverse Square Root [65.67315418971688]
We propose two more efficient variants to compute the differentiable matrix square root and the inverse square root.
For the forward propagation, one method is to use Matrix Taylor Polynomial (MTP), and the other method is to use Matrix Pad'e Approximants (MPA)
A series of numerical tests show that both methods yield considerable speed-up compared with the SVD or the NS iteration.
arXiv Detail & Related papers (2022-01-29T10:00:35Z) - Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via
GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer.
In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph.
Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z) - Restructuring, Pruning, and Adjustment of Deep Models for Parallel
Distributed Inference [15.720414948573753]
We consider the parallel implementation of an already-trained deep model on multiple processing nodes (a.k.a. workers)
We propose RePurpose, a layer-wise model restructuring and pruning technique that guarantees the performance of the overall parallelized model.
We show that, compared to the existing methods, RePurpose significantly improves the efficiency of the distributed inference via parallel implementation.
arXiv Detail & Related papers (2020-08-19T06:44:41Z) - CWY Parametrization: a Solution for Parallelized Optimization of
Orthogonal and Stiefel Matrices [41.57234424773276]
We introduce an efficient approach for optimization over orthogonal groups on highly parallel computation units such as GPUs or TPUs.
We further develop a novel Truncated CWY (or T-CWY) approach for Stiefel manifold parametrization.
We apply our methods to train recurrent neural network architectures in the tasks of neural machine video prediction.
arXiv Detail & Related papers (2020-04-18T17:58:43Z) - Minimal Filtering Algorithms for Convolutional Neural Networks [82.24592140096622]
We develop fully parallel hardware-oriented algorithms for implementing the basic filtering operation for M=3,5,7,9, and 11.
A fully parallel hardware implementation of the proposed algorithms in each case gives approximately 30 percent savings in the number of embedded multipliers.
arXiv Detail & Related papers (2020-04-12T13:18:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.