On the Theory of Implicit Deep Learning: Global Convergence with
Implicit Layers
- URL: http://arxiv.org/abs/2102.07346v2
- Date: Thu, 18 Feb 2021 18:39:14 GMT
- Title: On the Theory of Implicit Deep Learning: Global Convergence with
Implicit Layers
- Authors: Kenji Kawaguchi
- Abstract summary: A deep equilibrium model uses implicit numerical sequence which are implicitly defined through an equilibrium point sequence of computation.
We prove a relation between the dynamics of the deep implicit and the dynamics of the dynamics method of a shallow layer of trust.
- Score: 6.548580592686076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A deep equilibrium model uses implicit layers, which are implicitly defined
through an equilibrium point of an infinite sequence of computation. It avoids
any explicit computation of the infinite sequence by finding an equilibrium
point directly via root-finding and by computing gradients via implicit
differentiation. In this paper, we analyze the gradient dynamics of deep
equilibrium models with nonlinearity only on weight matrices and non-convex
objective functions of weights for regression and classification. Despite
non-convexity, convergence to global optimum at a linear rate is guaranteed
without any assumption on the width of the models, allowing the width to be
smaller than the output dimension and the number of data points. Moreover, we
prove a relation between the gradient dynamics of the deep implicit layer and
the dynamics of trust region Newton method of a shallow explicit layer. This
mathematically proven relation along with our numerical observation suggests
the importance of understanding implicit bias of implicit layers and an open
problem on the topic. Our proofs deal with implicit layers, weight tying and
nonlinearity on weights, and differ from those in the related literature.
Related papers
- Weak Correlations as the Underlying Principle for Linearization of
Gradient-Based Learning Systems [1.0878040851638]
This paper delves into gradient descent-based learning algorithms, that display a linear structure in their parameter dynamics.
We establish this apparent linearity arises due to weak correlations between the first and higher-order derivatives of the hypothesis function.
Exploiting the relationship between linearity and weak correlations, we derive a bound on deviations from linearity observed during the training trajectory of gradient descent.
arXiv Detail & Related papers (2024-01-08T16:44:23Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Gradient is All You Need? [0.0]
In this paper we provide a novel analytical perspective on the theoretical understanding of learning algorithms by interpreting consensus-based gradient-based optimization (CBO)
Our results prove the intrinsic power of CBO to alleviate the complexities of the nonlocal landscape function.
arXiv Detail & Related papers (2023-06-16T11:30:55Z) - Learning Discretized Neural Networks under Ricci Flow [51.36292559262042]
We study Discretized Neural Networks (DNNs) composed of low-precision weights and activations.
DNNs suffer from either infinite or zero gradients due to the non-differentiable discrete function during training.
arXiv Detail & Related papers (2023-02-07T10:51:53Z) - Dynamical chaos in nonlinear Schr\"odinger models with subquadratic
power nonlinearity [137.6408511310322]
We deal with a class of nonlinear Schr"odinger lattices with random potential and subquadratic power nonlinearity.
We show that the spreading process is subdiffusive and has complex microscopic organization.
The limit of quadratic power nonlinearity is also discussed and shown to result in a delocalization border.
arXiv Detail & Related papers (2023-01-20T16:45:36Z) - Global Convergence of Over-parameterized Deep Equilibrium Models [52.65330015267245]
A deep equilibrium model (DEQ) is implicitly defined through an equilibrium point of an infinite-depth weight-tied model with an input-injection.
Instead of infinite computations, it solves an equilibrium point directly with root-finding and computes gradients with implicit differentiation.
We propose a novel probabilistic framework to overcome the technical difficulty in the non-asymptotic analysis of infinite-depth weight-tied models.
arXiv Detail & Related papers (2022-05-27T08:00:13Z) - Gradient Descent Optimizes Infinite-Depth ReLU Implicit Networks with
Linear Widths [25.237054775800164]
This paper studies the convergence of gradient flow and gradient descent for nonlinear ReLU activated implicit networks.
We prove that both GF and GD converge to a global minimum at a linear rate if the width $m$ of the implicit network is textitlinear in the sample size.
arXiv Detail & Related papers (2022-05-16T06:07:56Z) - On Convergence of Training Loss Without Reaching Stationary Points [62.41370821014218]
We show that Neural Network weight variables do not converge to stationary points where the gradient the loss function vanishes.
We propose a new perspective based on ergodic theory dynamical systems.
arXiv Detail & Related papers (2021-10-12T18:12:23Z) - The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer
Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data.
We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.