Related papers: A Unifying View on Implicit Bias in Training Linear Neural Networks

A Unifying View on Implicit Bias in Training Linear Neural Networks

URL: http://arxiv.org/abs/2010.02501v3
Date: Fri, 10 Sep 2021 05:33:27 GMT
Title: A Unifying View on Implicit Bias in Training Linear Neural Networks
Authors: Chulhee Yun, Shankar Krishnan, Hossein Mobahi
Abstract summary: We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training. We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases.
Score: 31.65006970108761
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training. We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases, and investigate the linear version of the formulation called linear tensor networks. With this formulation, we can characterize the convergence direction of the network parameters as singular vectors of a tensor defined by the network. For $L$-layer linear tensor networks that are orthogonally decomposable, we show that gradient flow on separable classification finds a stationary point of the $\ell_{2/L}$ max-margin problem in a "transformed" input space defined by the network. For underdetermined regression, we prove that gradient flow finds a global minimum which minimizes a norm-like function that interpolates between weighted $\ell_1$ and $\ell_2$ norms in the transformed input space. Our theorems subsume existing results in the literature while removing standard convergence assumptions. We also provide experiments that corroborate our analysis.

Related papers

Low-rank bias, weight decay, and model merging in neural networks [12.352761060862072]
We show several properties of deep neural networks trained with $L2$ regularization. We also investigate a multitask learning phenomenon enabled by $L2$ regularization and low-rank bias.
arXiv Detail & Related papers (2025-02-24T17:17:00Z)
Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well. While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z)
Approximation Results for Gradient Descent trained Neural Networks [0.0]
The networks are fully connected constant depth increasing width. The continuous kernel error norm implies an approximation under the natural smoothness assumption required for smooth functions.
arXiv Detail & Related papers (2023-09-09T18:47:55Z)
Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights. We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z)
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two. For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z)
On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons. Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z)
Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent. We show that SGD is biased towards a simple solution. We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z)
Implicit Bias of Gradient Descent for Mean Squared Error Regression with Two-Layer Wide Neural Networks [1.3706331473063877]
We show that the solution of training a width-$n$ shallow ReLU network is within $n- 1/2$ of the function which fits the training data. We also show that the training trajectories are captured by trajectories of smoothing splines with decreasing regularization strength.
arXiv Detail & Related papers (2020-06-12T17:46:40Z)
Neural Networks are Convex Regularizers: Exact Polynomial-time Convex Optimization Formulations for Two-layer Networks [70.15611146583068]
We develop exact representations of training two-layer neural networks with rectified linear units (ReLUs) Our theory utilizes semi-infinite duality and minimum norm regularization.
arXiv Detail & Related papers (2020-02-24T21:32:41Z)
How Implicit Regularization of ReLU Neural Networks Characterizes the Learned Function -- Part I: the 1-D Case of Two Layers with Random First Layer [5.969858080492586]
We consider one dimensional (shallow) ReLU neural networks in which weights are chosen randomly and only the terminal layer is trained. We show that for such networks L2-regularized regression corresponds in function space to regularizing the estimate's second derivative for fairly general loss functionals.
arXiv Detail & Related papers (2019-11-07T13:48:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.