Support Vectors and Gradient Dynamics for Implicit Bias in ReLU Networks
- URL: http://arxiv.org/abs/2202.05510v1
- Date: Fri, 11 Feb 2022 08:55:58 GMT
- Title: Support Vectors and Gradient Dynamics for Implicit Bias in ReLU Networks
- Authors: Sangmin Lee, Byeongsu Sim, Jong Chul Ye
- Abstract summary: We study gradient flow dynamics in the parameter space when training single-neuron ReLU networks.
Specifically, we discover implicit bias in terms of support vectors in ReLU networks, which play a key role in why and how ReLU networks generalize well.
- Score: 45.886537625951256
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding implicit bias of gradient descent has been an important goal in
machine learning research. Unfortunately, even for a single-neuron ReLU
network, it recently proved impossible to characterize the implicit
regularization with the square loss by an explicit function of the norm of
model parameters. In order to close the gap between the existing theory and the
intriguing empirical behavior of ReLU networks, here we examine the gradient
flow dynamics in the parameter space when training single-neuron ReLU networks.
Specifically, we discover implicit bias in terms of support vectors in ReLU
networks, which play a key role in why and how ReLU networks generalize well.
Moreover, we analyze gradient flows with respect to the magnitude of the norm
of initialization, and show the impact of the norm in gradient dynamics.
Lastly, under some conditions, we prove that the norm of the learned weight
strictly increases on the gradient flow.
Related papers
- Generalization of Scaled Deep ResNets in the Mean-Field Regime [55.77054255101667]
We investigate emphscaled ResNet in the limit of infinitely deep and wide neural networks.
Our results offer new insights into the generalization ability of deep ResNet beyond the lazy training regime.
arXiv Detail & Related papers (2024-03-14T21:48:00Z) - Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - A Dynamics Theory of Implicit Regularization in Deep Low-Rank Matrix
Factorization [21.64166573203593]
Implicit regularization is an important way to interpret neural networks.
Recent theory starts to explain implicit regularization with the model of deep matrix factorization (DMF)
arXiv Detail & Related papers (2022-12-29T02:11:19Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Gradient descent provably escapes saddle points in the training of shallow ReLU networks [6.458742319938318]
We prove a variant of the relevant dynamical systems result, a center-stable manifold theorem, in which we relax some of the regularity requirements.
Building on a detailed examination of critical points of the square integral loss function for shallow ReLU and leaky ReLU networks, we show that gradient descents most saddle points.
arXiv Detail & Related papers (2022-08-03T14:08:52Z) - Gradient flow dynamics of shallow ReLU networks for square loss and
orthogonal inputs [19.401271427657395]
The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution.
This article presents the gradient flow dynamics of one neural network for the mean squared error at small initialisation.
arXiv Detail & Related papers (2022-06-02T09:01:25Z) - Training invariances and the low-rank phenomenon: beyond linear networks [44.02161831977037]
We show that when one trains a deep linear network with logistic or exponential loss on linearly separable data, the weights converge to rank-$1$ matrices.
This is the first time a low-rank phenomenon is proven rigorously for nonlinear ReLU-activated feedforward networks.
Our proof relies on a specific decomposition of the network into a multilinear function and another ReLU network whose weights are constant under a certain parameter directional convergence.
arXiv Detail & Related papers (2022-01-28T07:31:19Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Shallow Univariate ReLu Networks as Splines: Initialization, Loss
Surface, Hessian, & Gradient Flow Dynamics [1.5393457051344297]
We propose reparametrizing ReLU NNs as continuous piecewise linear splines.
We develop a surprisingly simple and transparent view of the structure of the loss surface, including its critical and fixed points, Hessian, and Hessian spectrum.
Videos of learning dynamics using a spline-based visualization are available at http://shorturl.at/tFWZ2.
arXiv Detail & Related papers (2020-08-04T19:19:49Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.