Critical Points and Convergence Analysis of Generative Deep Linear
Networks Trained with Bures-Wasserstein Loss
- URL: http://arxiv.org/abs/2303.03027v2
- Date: Thu, 1 Jun 2023 12:33:27 GMT
- Title: Critical Points and Convergence Analysis of Generative Deep Linear
Networks Trained with Bures-Wasserstein Loss
- Authors: Pierre Br\'echet, Katerina Papagiannouli, Jing An, Guido Mont\'ufar
- Abstract summary: We consider a deep matrix factorization model of covariance matrices trained with the Bures-Wasserstein distance.
We characterize the critical points and minimizers of the Bures-Wasserstein distance over the space of rank-bounded matrices.
We establish convergence results for gradient flow using a smooth perturbative version of the loss and convergence results for finite step size gradient descent.
- Score: 2.294014185517203
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We consider a deep matrix factorization model of covariance matrices trained
with the Bures-Wasserstein distance. While recent works have made important
advances in the study of the optimization problem for overparametrized low-rank
matrix approximation, much emphasis has been placed on discriminative settings
and the square loss. In contrast, our model considers another interesting type
of loss and connects with the generative setting. We characterize the critical
points and minimizers of the Bures-Wasserstein distance over the space of
rank-bounded matrices. For low-rank matrices the Hessian of this loss can
theoretically blow up, which creates challenges to analyze convergence of
optimizaton methods. We establish convergence results for gradient flow using a
smooth perturbative version of the loss and convergence results for finite step
size gradient descent under certain assumptions on the initial weights.
Related papers
- Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Curvature-Independent Last-Iterate Convergence for Games on Riemannian
Manifolds [77.4346324549323]
We show that a step size agnostic to the curvature of the manifold achieves a curvature-independent and linear last-iterate convergence rate.
To the best of our knowledge, the possibility of curvature-independent rates and/or last-iterate convergence has not been considered before.
arXiv Detail & Related papers (2023-06-29T01:20:44Z) - The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks.
We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z) - Implicit Balancing and Regularization: Generalization and Convergence
Guarantees for Overparameterized Asymmetric Matrix Sensing [28.77440901439686]
A series of recent papers have begun to generalize this role for non-random Positive Semi-Defin (PSD) matrix sensing problems.
In this paper, we show that the trajectory of the gradient descent from small random measurements moves towards solutions that are both globally well.
arXiv Detail & Related papers (2023-03-24T19:05:52Z) - On Asymptotic Linear Convergence of Projected Gradient Descent for
Constrained Least Squares [22.851500417035947]
This manuscript presents a unified framework for the analysis of projected gradient descent in the context of constrained least squares.
We present a recipe for the convergence analysis of PGD and demonstrate it via a beginning-to-end application of the recipe on four fundamental problems.
arXiv Detail & Related papers (2021-12-22T09:49:51Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Optimal Rates for Random Order Online Optimization [60.011653053877126]
We study the citetgarber 2020online, where the loss functions may be chosen by an adversary, but are then presented online in a uniformly random order.
We show that citetgarber 2020online algorithms achieve the optimal bounds and significantly improve their stability.
arXiv Detail & Related papers (2021-06-29T09:48:46Z) - Small random initialization is akin to spectral learning: Optimization
and generalization guarantees for overparameterized low-rank matrix
reconstruction [35.585697639325105]
In this paper we show that small random initialization are not fully understood.
We reconstruct a gradient from a small randomrank matrix and find solutions akin to an optimal gradient from a low randomrank matrix.
arXiv Detail & Related papers (2021-06-28T22:52:39Z) - Exact Linear Convergence Rate Analysis for Low-Rank Symmetric Matrix
Completion via Gradient Descent [22.851500417035947]
Factorization-based gradient descent is a scalable and efficient algorithm for solving the factorrank matrix completion.
We show that gradient descent enjoys fast convergence to estimate a solution of the global nature problem.
arXiv Detail & Related papers (2021-02-04T03:41:54Z) - Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled
Gradient Descent [34.0533596121548]
Low-rank matrix estimation converges convex problem that finds numerous applications in signal processing, machine learning and imaging science.
We show that ScaledGD achieves the best of the best in terms of the number of the low-rank matrix.
Our analysis is also applicable to general loss that are similar to low-rank gradient descent.
arXiv Detail & Related papers (2020-05-18T17:17:16Z) - Semiparametric Nonlinear Bipartite Graph Representation Learning with
Provable Guarantees [106.91654068632882]
We consider the bipartite graph and formalize its representation learning problem as a statistical estimation problem of parameters in a semiparametric exponential family distribution.
We show that the proposed objective is strongly convex in a neighborhood around the ground truth, so that a gradient descent-based method achieves linear convergence rate.
Our estimator is robust to any model misspecification within the exponential family, which is validated in extensive experiments.
arXiv Detail & Related papers (2020-03-02T16:40:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.