A Unifying View on Implicit Bias in Training Linear Neural Networks
- URL: http://arxiv.org/abs/2010.02501v3
- Date: Fri, 10 Sep 2021 05:33:27 GMT
- Title: A Unifying View on Implicit Bias in Training Linear Neural Networks
- Authors: Chulhee Yun, Shankar Krishnan, Hossein Mobahi
- Abstract summary: We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training.
We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases.
- Score: 31.65006970108761
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the implicit bias of gradient flow (i.e., gradient descent with
infinitesimal step size) on linear neural network training. We propose a tensor
formulation of neural networks that includes fully-connected, diagonal, and
convolutional networks as special cases, and investigate the linear version of
the formulation called linear tensor networks. With this formulation, we can
characterize the convergence direction of the network parameters as singular
vectors of a tensor defined by the network. For $L$-layer linear tensor
networks that are orthogonally decomposable, we show that gradient flow on
separable classification finds a stationary point of the $\ell_{2/L}$
max-margin problem in a "transformed" input space defined by the network. For
underdetermined regression, we prove that gradient flow finds a global minimum
which minimizes a norm-like function that interpolates between weighted
$\ell_1$ and $\ell_2$ norms in the transformed input space. Our theorems
subsume existing results in the literature while removing standard convergence
assumptions. We also provide experiments that corroborate our analysis.
Related papers
- Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - Approximation Results for Gradient Descent trained Neural Networks [0.0]
The networks are fully connected constant depth increasing width.
The continuous kernel error norm implies an approximation under the natural smoothness assumption required for smooth functions.
arXiv Detail & Related papers (2023-09-09T18:47:55Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Implicit Bias of Gradient Descent for Mean Squared Error Regression with
Two-Layer Wide Neural Networks [1.3706331473063877]
We show that the solution of training a width-$n$ shallow ReLU network is within $n- 1/2$ of the function which fits the training data.
We also show that the training trajectories are captured by trajectories of smoothing splines with decreasing regularization strength.
arXiv Detail & Related papers (2020-06-12T17:46:40Z) - Neural Networks are Convex Regularizers: Exact Polynomial-time Convex
Optimization Formulations for Two-layer Networks [70.15611146583068]
We develop exact representations of training two-layer neural networks with rectified linear units (ReLUs)
Our theory utilizes semi-infinite duality and minimum norm regularization.
arXiv Detail & Related papers (2020-02-24T21:32:41Z) - How Implicit Regularization of ReLU Neural Networks Characterizes the
Learned Function -- Part I: the 1-D Case of Two Layers with Random First
Layer [5.969858080492586]
We consider one dimensional (shallow) ReLU neural networks in which weights are chosen randomly and only the terminal layer is trained.
We show that for such networks L2-regularized regression corresponds in function space to regularizing the estimate's second derivative for fairly general loss functionals.
arXiv Detail & Related papers (2019-11-07T13:48:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.