Related papers: Accelerating Feedforward Computation via Parallel Nonlinear Equation Solving

Accelerating Feedforward Computation via Parallel Nonlinear Equation Solving

URL: http://arxiv.org/abs/2002.03629v2
Date: Fri, 11 Jun 2021 21:44:07 GMT
Title: Accelerating Feedforward Computation via Parallel Nonlinear Equation Solving
Authors: Yang Song, Chenlin Meng, Renjie Liao, Stefano Ermon
Abstract summary: Feedforward computation, such as evaluating a neural network or sampling from an autoregressive model, is ubiquitous in machine learning. We frame the task of feedforward computation as solving a system of nonlinear equations. We then propose to find the solution using a Jacobi or Gauss-Seidel fixed-point method, as well as hybrid methods of both. Our method is guaranteed to give exactly the same values as the original feedforward computation with a reduced (or equal) number of parallelizable iterations, and hence reduced time given sufficient parallel computing power.
Score: 106.63673243937492
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Feedforward computation, such as evaluating a neural network or sampling from an autoregressive model, is ubiquitous in machine learning. The sequential nature of feedforward computation, however, requires a strict order of execution and cannot be easily accelerated with parallel computing. To enable parallelization, we frame the task of feedforward computation as solving a system of nonlinear equations. We then propose to find the solution using a Jacobi or Gauss-Seidel fixed-point iteration method, as well as hybrid methods of both. Crucially, Jacobi updates operate independently on each equation and can be executed in parallel. Our method is guaranteed to give exactly the same values as the original feedforward computation with a reduced (or equal) number of parallelizable iterations, and hence reduced time given sufficient parallel computing power. Experimentally, we demonstrate the effectiveness of our approach in accelerating (i) backpropagation of RNNs, (ii) evaluation of DenseNets, and (iii) autoregressive sampling of MADE and PixelCNN++, with speedup factors between 2.1 and 26 under various settings.

Related papers

Self-Refining Diffusion Samplers: Enabling Parallelization via Parareal Iterations [53.180374639531145]
Self-Refining Diffusion Samplers (SRDS) retain sample quality and can improve latency at the cost of additional parallel compute. We take inspiration from the Parareal algorithm, a popular numerical method for parallel-in-time integration of differential equations.
arXiv Detail & Related papers (2024-12-11T11:08:09Z)
Closing the Computational-Query Depth Gap in Parallel Stochastic Convex Optimization [26.36906884097317]
We develop a new parallel algorithm for minimizing Lipschitz, convex functions with a subgradient oracle. Our result closes a gap between the best-known query depth and the best-known computational depth of parallel algorithms.
arXiv Detail & Related papers (2024-06-11T15:41:48Z)
ADMM Algorithms for Residual Network Training: Convergence Analysis and Parallel Implementation [5.3446906736406135]
We propose both serial and parallel proximal (linearized) alternating direction method of multipliers (ADMM) algorithms for training residual neural networks. We prove that the proposed algorithms converge at an R-linear (sublinear) rate for both the iteration points and the objective function values. Experimental results validate the proposed ADMM algorithms, demonstrating rapid and stable convergence, improved performance, and high computational efficiency.
arXiv Detail & Related papers (2023-10-23T20:01:06Z)
Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties. We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z)
DeepPCR: Parallelizing Sequential Operations in Neural Networks [4.241834259165193]
We introduce DeepPCR, a novel algorithm which parallelizes typically sequential operations in order to speed up inference and training of neural networks. DeepPCR is based on interpreting a sequence of $L$ steps as the solution of a specific system of equations, which we recover using the Parallel Cyclic Reduction algorithm. To verify the theoretical lower complexity of the algorithm, and to identify regimes for speedup, we test the effectiveness of DeepPCR in parallelizing the forward and backward pass in multi-layer perceptrons.
arXiv Detail & Related papers (2023-09-28T10:15:30Z)
Randomized Polar Codes for Anytime Distributed Machine Learning [66.46612460837147]
We present a novel distributed computing framework that is robust to slow compute nodes, and is capable of both approximate and exact computation of linear operations. We propose a sequential decoding algorithm designed to handle real valued data while maintaining low computational complexity for recovery. We demonstrate the potential applications of this framework in various contexts, such as large-scale matrix multiplication and black-box optimization.
arXiv Detail & Related papers (2023-09-01T18:02:04Z)
Embarrassingly Parallel Independent Training of Multi-Layer Perceptrons with Heterogeneous Architectures [2.094821665776961]
ParallelMLPs is a procedure to enable the training of several independent Perceptron Neural Networks with a different number of neurons and activation functions in parallel. We have assessed our algorithm in simulated datasets varying the number of samples, features and batches using 10,000 different models. We achieved a training speedup from 1 to 4 orders of magnitude if compared to the sequential approach.
arXiv Detail & Related papers (2022-06-14T02:00:31Z)
Fast Differentiable Matrix Square Root and Inverse Square Root [65.67315418971688]
We propose two more efficient variants to compute the differentiable matrix square root and the inverse square root. For the forward propagation, one method is to use Matrix Taylor Polynomial (MTP), and the other method is to use Matrix Pad'e Approximants (MPA) A series of numerical tests show that both methods yield considerable speed-up compared with the SVD or the NS iteration.
arXiv Detail & Related papers (2022-01-29T10:00:35Z)
Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer. In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph. Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z)
Restructuring, Pruning, and Adjustment of Deep Models for Parallel Distributed Inference [15.720414948573753]
We consider the parallel implementation of an already-trained deep model on multiple processing nodes (a.k.a. workers) We propose RePurpose, a layer-wise model restructuring and pruning technique that guarantees the performance of the overall parallelized model. We show that, compared to the existing methods, RePurpose significantly improves the efficiency of the distributed inference via parallel implementation.
arXiv Detail & Related papers (2020-08-19T06:44:41Z)
CWY Parametrization: a Solution for Parallelized Optimization of Orthogonal and Stiefel Matrices [41.57234424773276]
We introduce an efficient approach for optimization over orthogonal groups on highly parallel computation units such as GPUs or TPUs. We further develop a novel Truncated CWY (or T-CWY) approach for Stiefel manifold parametrization. We apply our methods to train recurrent neural network architectures in the tasks of neural machine video prediction.
arXiv Detail & Related papers (2020-04-18T17:58:43Z)
Minimal Filtering Algorithms for Convolutional Neural Networks [82.24592140096622]
We develop fully parallel hardware-oriented algorithms for implementing the basic filtering operation for M=3,5,7,9, and 11. A fully parallel hardware implementation of the proposed algorithms in each case gives approximately 30 percent savings in the number of embedded multipliers.
arXiv Detail & Related papers (2020-04-12T13:18:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.