Unique Properties of Flat Minima in Deep Networks
- URL: http://arxiv.org/abs/2002.04710v2
- Date: Sat, 8 Aug 2020 22:13:17 GMT
- Title: Unique Properties of Flat Minima in Deep Networks
- Authors: Rotem Mulayoff, Tomer Michaeli
- Abstract summary: We characterize the flat minima in linear neural networks trained with a quadratic loss.
Experiments indicate that these properties are characteristic of both linear and nonlinear models trained in practice.
- Score: 44.21198403467404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It is well known that (stochastic) gradient descent has an implicit bias
towards flat minima. In deep neural network training, this mechanism serves to
screen out minima. However, the precise effect that this has on the trained
network is not yet fully understood. In this paper, we characterize the flat
minima in linear neural networks trained with a quadratic loss. First, we show
that linear ResNets with zero initialization necessarily converge to the
flattest of all minima. We then prove that these minima correspond to nearly
balanced networks whereby the gain from the input to any intermediate
representation does not change drastically from one layer to the next. Finally,
we show that consecutive layers in flat minima solutions are coupled. That is,
one of the left singular vectors of each weight matrix, equals one of the right
singular vectors of the next matrix. This forms a distinct path from input to
output, that, as we show, is dedicated to the signal that experiences the
largest gain end-to-end. Experiments indicate that these properties are
characteristic of both linear and nonlinear models trained in practice.
Related papers
- Deep linear networks for regression are implicitly regularized towards flat minima [4.806579822134391]
Minimizers can have arbitrarily large sharpness, but not an arbitrarily small one.
We show a lower bound on the sharpness of minimizers, which grows linearly with depth.
We show an implicit regularization towards flat minima: the sharpness of the minimizer is no more than a constant times the lower bound.
arXiv Detail & Related papers (2024-05-22T08:58:51Z) - Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks.
We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Plateau in Monotonic Linear Interpolation -- A "Biased" View of Loss
Landscape for Deep Networks [18.71055320062469]
Monotonic linear (MLI) is a phenomenon that is commonly observed in the training of neural networks.
We show that the MLI property is not necessarily related to the hardness of optimization problems.
In particular, we show that interpolating both weights and biases linearly leads to very different influences on the final output.
arXiv Detail & Related papers (2022-10-03T15:33:29Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - Implicit Regularization Towards Rank Minimization in ReLU Networks [34.41953136999683]
We study the conjectured relationship between the implicit regularization in neural networks and rank minimization.
We focus on nonlinear ReLU networks, providing several new positive and negative results.
arXiv Detail & Related papers (2022-01-30T09:15:44Z) - Dissecting Supervised Constrastive Learning [24.984074794337157]
Minimizing cross-entropy over the softmax scores of a linear map composed with a high-capacity encoder is arguably the most popular choice for training neural networks on supervised learning tasks.
We show that one can directly optimize the encoder instead, to obtain equally (or even more) discriminative representations via a supervised variant of a contrastive objective.
arXiv Detail & Related papers (2021-02-17T15:22:38Z) - A case where a spindly two-layer linear network whips any neural network
with a fully connected input layer [24.132345589750592]
We show that a sparse input layer is needed to sample efficiently learn sparse targets with gradient descent.
Surprisingly the same type of problem can be solved drastically more efficient by a simple 2-layer linear neural network.
arXiv Detail & Related papers (2020-10-16T20:49:58Z) - Piecewise linear activations substantially shape the loss surfaces of
neural networks [95.73230376153872]
This paper presents how piecewise linear activation functions substantially shape the loss surfaces of neural networks.
We first prove that it the loss surfaces of many neural networks have infinite spurious local minima which are defined as the local minima with higher empirical risks than the global minima.
For one-hidden-layer networks, we prove that all local minima in a cell constitute an equivalence class; they are concentrated in a valley; and they are all global minima in the cell.
arXiv Detail & Related papers (2020-03-27T04:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.