Gradient Descent Optimizes Infinite-Depth ReLU Implicit Networks with
Linear Widths
- URL: http://arxiv.org/abs/2205.07463v1
- Date: Mon, 16 May 2022 06:07:56 GMT
- Title: Gradient Descent Optimizes Infinite-Depth ReLU Implicit Networks with
Linear Widths
- Authors: Tianxiang Gao, Hongyang Gao
- Abstract summary: This paper studies the convergence of gradient flow and gradient descent for nonlinear ReLU activated implicit networks.
We prove that both GF and GD converge to a global minimum at a linear rate if the width $m$ of the implicit network is textitlinear in the sample size.
- Score: 25.237054775800164
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Implicit deep learning has recently become popular in the machine learning
community since these implicit models can achieve competitive performance with
state-of-the-art deep networks while using significantly less memory and
computational resources. However, our theoretical understanding of when and how
first-order methods such as gradient descent (GD) converge on
\textit{nonlinear} implicit networks is limited. Although this type of problem
has been studied in standard feed-forward networks, the case of implicit models
is still intriguing because implicit networks have \textit{infinitely} many
layers. The corresponding equilibrium equation probably admits no or multiple
solutions during training. This paper studies the convergence of both gradient
flow (GF) and gradient descent for nonlinear ReLU activated implicit networks.
To deal with the well-posedness problem, we introduce a fixed scalar to scale
the weight matrix of the implicit layer and show that there exists a small
enough scaling constant, keeping the equilibrium equation well-posed throughout
training. As a result, we prove that both GF and GD converge to a global
minimum at a linear rate if the width $m$ of the implicit network is
\textit{linear} in the sample size $N$, i.e., $m=\Omega(N)$.
Related papers
- The Implicit Bias of Minima Stability in Multivariate Shallow ReLU
Networks [53.95175206863992]
We study the type of solutions to which gradient descent converges when used to train a single hidden-layer multivariate ReLU network with the quadratic loss.
We prove that although shallow ReLU networks are universal approximators, stable shallow networks are not.
arXiv Detail & Related papers (2023-06-30T09:17:39Z) - A Unified Algebraic Perspective on Lipschitz Neural Networks [88.14073994459586]
This paper introduces a novel perspective unifying various types of 1-Lipschitz neural networks.
We show that many existing techniques can be derived and generalized via finding analytical solutions of a common semidefinite programming (SDP) condition.
Our approach, called SDP-based Lipschitz Layers (SLL), allows us to design non-trivial yet efficient generalization of convex potential layers.
arXiv Detail & Related papers (2023-03-06T14:31:09Z) - Globally Gated Deep Linear Networks [3.04585143845864]
We introduce Globally Gated Deep Linear Networks (GGDLNs) where gating units are shared among all processing units in each layer.
We derive exact equations for the generalization properties in these networks in the finite-width thermodynamic limit.
Our work is the first exact theoretical solution of learning in a family of nonlinear networks with finite width.
arXiv Detail & Related papers (2022-10-31T16:21:56Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Implicit Regularization Towards Rank Minimization in ReLU Networks [34.41953136999683]
We study the conjectured relationship between the implicit regularization in neural networks and rank minimization.
We focus on nonlinear ReLU networks, providing several new positive and negative results.
arXiv Detail & Related papers (2022-01-30T09:15:44Z) - Training invariances and the low-rank phenomenon: beyond linear networks [44.02161831977037]
We show that when one trains a deep linear network with logistic or exponential loss on linearly separable data, the weights converge to rank-$1$ matrices.
This is the first time a low-rank phenomenon is proven rigorously for nonlinear ReLU-activated feedforward networks.
Our proof relies on a specific decomposition of the network into a multilinear function and another ReLU network whose weights are constant under a certain parameter directional convergence.
arXiv Detail & Related papers (2022-01-28T07:31:19Z) - A global convergence theory for deep ReLU implicit networks via
over-parameterization [26.19122384935622]
Implicit deep learning has received increasing attention recently.
This paper analyzes the gradient flow of Rectified Linear Unit (ReLU) activated implicit neural networks.
arXiv Detail & Related papers (2021-10-11T23:22:50Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - A Unifying View on Implicit Bias in Training Linear Neural Networks [31.65006970108761]
We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training.
We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases.
arXiv Detail & Related papers (2020-10-06T06:08:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.