CWY Parametrization: a Solution for Parallelized Optimization of
Orthogonal and Stiefel Matrices
- URL: http://arxiv.org/abs/2004.08675v3
- Date: Tue, 16 Feb 2021 13:19:03 GMT
- Title: CWY Parametrization: a Solution for Parallelized Optimization of
Orthogonal and Stiefel Matrices
- Authors: Valerii Likhosherstov, Jared Davis, Krzysztof Choromanski, Adrian
Weller
- Abstract summary: We introduce an efficient approach for optimization over orthogonal groups on highly parallel computation units such as GPUs or TPUs.
We further develop a novel Truncated CWY (or T-CWY) approach for Stiefel manifold parametrization.
We apply our methods to train recurrent neural network architectures in the tasks of neural machine video prediction.
- Score: 41.57234424773276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce an efficient approach for optimization over orthogonal groups on
highly parallel computation units such as GPUs or TPUs. As in earlier work, we
parametrize an orthogonal matrix as a product of Householder reflections.
However, to overcome low parallelization capabilities of computing Householder
reflections sequentially, we propose employing an accumulation scheme called
the compact WY (or CWY) transform -- a compact parallelization-friendly matrix
representation for the series of Householder reflections. We further develop a
novel Truncated CWY (or T-CWY) approach for Stiefel manifold parametrization
which has a competitive complexity and, again, yields benefits when computed on
GPUs and TPUs. We prove that our CWY and T-CWY methods lead to convergence to a
stationary point of the training objective when coupled with stochastic
gradient descent. We apply our methods to train recurrent neural network
architectures in the tasks of neural machine translation and video prediction.
Related papers
- Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective.
We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices.
Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Iterative Reweighted Least Squares Networks With Convergence Guarantees
for Solving Inverse Imaging Problems [12.487990897680422]
We present a novel optimization strategy for image reconstruction tasks under analysis-based image regularization.
We parameterize such regularizers using potential functions that correspond to weighted extensions of the $ell_pp$-vector and $mathcalS_pp$ Schatten-matrix quasi-norms.
We show that thanks to the convergence guarantees of our proposed minimization strategy, such optimization can be successfully performed with a memory-efficient implicit back-propagation scheme.
arXiv Detail & Related papers (2023-08-10T17:59:46Z) - Givens Coordinate Descent Methods for Rotation Matrix Learning in
Trainable Embedding Indexes [19.716527782586788]
We propose a family of block Givens coordinate descent algorithms to learn rotation matrix.
Compared to the state-of-the-art SVD method, the Givens algorithms are much more parallelizable.
arXiv Detail & Related papers (2022-03-09T22:58:56Z) - High-Dimensional Sparse Bayesian Learning without Covariance Matrices [66.60078365202867]
We introduce a new inference scheme that avoids explicit construction of the covariance matrix.
Our approach couples a little-known diagonal estimation result from numerical linear algebra with the conjugate gradient algorithm.
On several simulations, our method scales better than existing approaches in computation time and memory.
arXiv Detail & Related papers (2022-02-25T16:35:26Z) - Parallelized Computation and Backpropagation Under Angle-Parametrized
Orthogonal Matrices [0.0]
We show how an apparently sequential elementary rotation parametrization can be restructured into blocks of commutative operations.
We discuss parametric restrictions of interest to generative modeling and present promising performance results with a prototype GPU implementation.
arXiv Detail & Related papers (2021-05-30T00:47:03Z) - Randomized Block-Diagonal Preconditioning for Parallel Learning [0.0]
We study preconditioned gradient-based optimization methods where the preconditioning matrix has block-diagonal form.
Our main contribution is to demonstrate that the convergence of these methods can significantly be improved by a randomization technique.
arXiv Detail & Related papers (2020-06-24T10:12:36Z) - Controllable Orthogonalization in Training DNNs [96.1365404059924]
Orthogonality is widely used for training deep neural networks (DNNs) due to its ability to maintain all singular values of the Jacobian close to 1.
This paper proposes a computationally efficient and numerically stable orthogonalization method using Newton's iteration (ONI)
We show that our method improves the performance of image classification networks by effectively controlling the orthogonality to provide an optimal tradeoff between optimization benefits and representational capacity reduction.
We also show that ONI stabilizes the training of generative adversarial networks (GANs) by maintaining the Lipschitz continuity of a network, similar to spectral normalization (
arXiv Detail & Related papers (2020-04-02T10:14:27Z) - Accelerating Feedforward Computation via Parallel Nonlinear Equation
Solving [106.63673243937492]
Feedforward computation, such as evaluating a neural network or sampling from an autoregressive model, is ubiquitous in machine learning.
We frame the task of feedforward computation as solving a system of nonlinear equations. We then propose to find the solution using a Jacobi or Gauss-Seidel fixed-point method, as well as hybrid methods of both.
Our method is guaranteed to give exactly the same values as the original feedforward computation with a reduced (or equal) number of parallelizable iterations, and hence reduced time given sufficient parallel computing power.
arXiv Detail & Related papers (2020-02-10T10:11:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.