Related papers: Convergence beyond the over-parameterized regime using Rayleigh quotients

Convergence beyond the over-parameterized regime using Rayleigh quotients

URL: http://arxiv.org/abs/2301.08117v1
Date: Thu, 19 Jan 2023 15:18:23 GMT
Title: Convergence beyond the over-parameterized regime using Rayleigh quotients
Authors: David A. R. Robin, Kevin Scaman, Marc Lelarge
Abstract summary: We show that Rayleigh quotients provide a unified view for several convergence analysis techniques in the literature. Our strategy produces a proof of convergence for various examples of parametric learning.
Score: 18.728779959566946
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In this paper, we present a new strategy to prove the convergence of deep learning architectures to a zero training (or even testing) loss by gradient flow. Our analysis is centered on the notion of Rayleigh quotients in order to prove Kurdyka-{\L}ojasiewicz inequalities for a broader set of neural network architectures and loss functions. We show that Rayleigh quotients provide a unified view for several convergence analysis techniques in the literature. Our strategy produces a proof of convergence for various examples of parametric learning. In particular, our analysis does not require the number of parameters to tend to infinity, nor the number of samples to be finite, thus extending to test loss minimization and beyond the over-parameterized regime.

Related papers

Geometric Convergence Analysis of Variational Inference via Bregman Divergences [3.7098038388802252]
Vari rigorous Inference (VI) provides a scalable framework for inference by the Lower Evidence (ELBO)<n>We establish a novel theoretical framework for analyzing objective convergence by exploiting the exponential family distributions.
arXiv Detail & Related papers (2025-10-17T11:30:05Z)
Random Sparse Lifts: Construction, Analysis and Convergence of finite sparse networks [17.487761710665968]
We present a framework to define a large class of neural networks for which, by construction, training by gradient flow provably reaches arbitrarily low loss when the number of parameters grows.
arXiv Detail & Related papers (2025-01-10T12:52:00Z)
A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks. We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks. Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
Curvature-Independent Last-Iterate Convergence for Games on Riemannian Manifolds [77.4346324549323]
We show that a step size agnostic to the curvature of the manifold achieves a curvature-independent and linear last-iterate convergence rate. To the best of our knowledge, the possibility of curvature-independent rates and/or last-iterate convergence has not been considered before.
arXiv Detail & Related papers (2023-06-29T01:20:44Z)
Critical Points and Convergence Analysis of Generative Deep Linear Networks Trained with Bures-Wasserstein Loss [2.294014185517203]
We consider a deep matrix factorization model of covariance matrices trained with the Bures-Wasserstein distance. We characterize the critical points and minimizers of the Bures-Wasserstein distance over the space of rank-bounded matrices. We establish convergence results for gradient flow using a smooth perturbative version of the loss and convergence results for finite step size gradient descent.
arXiv Detail & Related papers (2023-03-06T10:56:14Z)
Implicit Regularization for Group Sparsity [33.487964460794764]
We show that gradient descent over the squared regression loss, without any explicit regularization, biases towards solutions with a group sparsity structure. We analyze the gradient dynamics of the corresponding regression problem in the general noise setting and obtain minimax-optimal error rates. In the degenerate case of size-one groups, our approach gives rise to a new algorithm for sparse linear regression.
arXiv Detail & Related papers (2023-01-29T20:54:03Z)
Convergence analysis of unsupervised Legendre-Galerkin neural networks for linear second-order elliptic PDEs [0.8594140167290099]
We perform the convergence analysis of unsupervised Legendre--Galerkin neural networks (ULGNet) ULGNet is a deep-learning-based numerical method for solving partial differential equations (PDEs)
arXiv Detail & Related papers (2022-11-16T13:31:03Z)
Annihilation of Spurious Minima in Two-Layer ReLU Networks [9.695960412426672]
We study the optimization problem associated with fitting two-layer ReLU neural networks with respect to the squared loss. We show that adding neurons can turn symmetric spurious minima into saddles. We also prove the existence of descent directions in certain subspaces arising from the symmetry structure of the loss function.
arXiv Detail & Related papers (2022-10-12T11:04:21Z)
On Convergence of Training Loss Without Reaching Stationary Points [62.41370821014218]
We show that Neural Network weight variables do not converge to stationary points where the gradient the loss function vanishes. We propose a new perspective based on ergodic theory dynamical systems.
arXiv Detail & Related papers (2021-10-12T18:12:23Z)
Probabilistic methods for approximate archetypal analysis [8.829245587252435]
Archetypal analysis is an unsupervised learning method for exploratory data analysis. We introduce two preprocessing techniques to reduce the dimension and representation cardinality of the data. We demonstrate the usefulness of our results by applying our method to summarize several moderately large-scale datasets.
arXiv Detail & Related papers (2021-08-12T14:27:11Z)
Stochastic Approximation for Online Tensorial Independent Component Analysis [98.34292831923335]
Independent component analysis (ICA) has been a popular dimension reduction tool in statistical machine learning and signal processing. In this paper, we present a by-product online tensorial algorithm that estimates for each independent component.
arXiv Detail & Related papers (2020-12-28T18:52:37Z)
Learning Fast Approximations of Sparse Nonlinear Regression [50.00693981886832]
In this work, we bridge the gap by introducing the Threshold Learned Iterative Shrinkage Algorithming (NLISTA) Experiments on synthetic data corroborate our theoretical results and show our method outperforms state-of-the-art methods.
arXiv Detail & Related papers (2020-10-26T11:31:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.