Implicit Regularization Towards Rank Minimization in ReLU Networks
- URL: http://arxiv.org/abs/2201.12760v1
- Date: Sun, 30 Jan 2022 09:15:44 GMT
- Title: Implicit Regularization Towards Rank Minimization in ReLU Networks
- Authors: Nadav Timor, Gal Vardi, Ohad Shamir
- Abstract summary: We study the conjectured relationship between the implicit regularization in neural networks and rank minimization.
We focus on nonlinear ReLU networks, providing several new positive and negative results.
- Score: 34.41953136999683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the conjectured relationship between the implicit regularization in
neural networks, trained with gradient-based methods, and rank minimization of
their weight matrices. Previously, it was proved that for linear networks (of
depth 2 and vector-valued outputs), gradient flow (GF) w.r.t. the square loss
acts as a rank minimization heuristic. However, understanding to what extent
this generalizes to nonlinear networks is an open problem. In this paper, we
focus on nonlinear ReLU networks, providing several new positive and negative
results. On the negative side, we prove (and demonstrate empirically) that,
unlike the linear case, GF on ReLU networks may no longer tend to minimize
ranks, in a rather strong sense (even approximately, for "most" datasets of
size 2). On the positive side, we reveal that ReLU networks of sufficient depth
are provably biased towards low-rank solutions in several reasonable settings.
Related papers
- Deep linear networks for regression are implicitly regularized towards flat minima [4.806579822134391]
Minimizers can have arbitrarily large sharpness, but not an arbitrarily small one.
We show a lower bound on the sharpness of minimizers, which grows linearly with depth.
We show an implicit regularization towards flat minima: the sharpness of the minimizer is no more than a constant times the lower bound.
arXiv Detail & Related papers (2024-05-22T08:58:51Z) - Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - The Implicit Bias of Minima Stability in Multivariate Shallow ReLU
Networks [53.95175206863992]
We study the type of solutions to which gradient descent converges when used to train a single hidden-layer multivariate ReLU network with the quadratic loss.
We prove that although shallow ReLU networks are universal approximators, stable shallow networks are not.
arXiv Detail & Related papers (2023-06-30T09:17:39Z) - The Double-Edged Sword of Implicit Bias: Generalization vs. Robustness
in ReLU Networks [64.12052498909105]
We study the implications of the implicit bias of gradient flow on generalization and adversarial robustness in ReLU networks.
In two-layer ReLU networks gradient flow is biased towards solutions that generalize well, but are highly vulnerable to adversarial examples.
arXiv Detail & Related papers (2023-03-02T18:14:35Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Gradient Descent Optimizes Infinite-Depth ReLU Implicit Networks with
Linear Widths [25.237054775800164]
This paper studies the convergence of gradient flow and gradient descent for nonlinear ReLU activated implicit networks.
We prove that both GF and GD converge to a global minimum at a linear rate if the width $m$ of the implicit network is textitlinear in the sample size.
arXiv Detail & Related papers (2022-05-16T06:07:56Z) - Training invariances and the low-rank phenomenon: beyond linear networks [44.02161831977037]
We show that when one trains a deep linear network with logistic or exponential loss on linearly separable data, the weights converge to rank-$1$ matrices.
This is the first time a low-rank phenomenon is proven rigorously for nonlinear ReLU-activated feedforward networks.
Our proof relies on a specific decomposition of the network into a multilinear function and another ReLU network whose weights are constant under a certain parameter directional convergence.
arXiv Detail & Related papers (2022-01-28T07:31:19Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity
Bias [34.81794649454105]
Real-life neural networks are from small random values and trained with cross-entropy loss for classification.
Recent results show that gradient descent converges to the "max-margin" solution with zero loss, which presumably generalizes well.
The current paper is able to establish this global optimality for two-layer ReLU nets trained with gradient flow on linearly separable and symmetric data.
arXiv Detail & Related papers (2021-10-26T17:57:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.