Related papers: Unique Properties of Flat Minima in Deep Networks

Unique Properties of Flat Minima in Deep Networks

URL: http://arxiv.org/abs/2002.04710v2
Date: Sat, 8 Aug 2020 22:13:17 GMT
Title: Unique Properties of Flat Minima in Deep Networks
Authors: Rotem Mulayoff, Tomer Michaeli
Abstract summary: We characterize the flat minima in linear neural networks trained with a quadratic loss. Experiments indicate that these properties are characteristic of both linear and nonlinear models trained in practice.
Score: 44.21198403467404
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: It is well known that (stochastic) gradient descent has an implicit bias towards flat minima. In deep neural network training, this mechanism serves to screen out minima. However, the precise effect that this has on the trained network is not yet fully understood. In this paper, we characterize the flat minima in linear neural networks trained with a quadratic loss. First, we show that linear ResNets with zero initialization necessarily converge to the flattest of all minima. We then prove that these minima correspond to nearly balanced networks whereby the gain from the input to any intermediate representation does not change drastically from one layer to the next. Finally, we show that consecutive layers in flat minima solutions are coupled. That is, one of the left singular vectors of each weight matrix, equals one of the right singular vectors of the next matrix. This forms a distinct path from input to output, that, as we show, is dedicated to the signal that experiences the largest gain end-to-end. Experiments indicate that these properties are characteristic of both linear and nonlinear models trained in practice.

Related papers

Deep linear networks for regression are implicitly regularized towards flat minima [4.806579822134391]
Minimizers can have arbitrarily large sharpness, but not an arbitrarily small one. We show a lower bound on the sharpness of minimizers, which grows linearly with depth. We show an implicit regularization towards flat minima: the sharpness of the minimizer is no more than a constant times the lower bound.
arXiv Detail & Related papers (2024-05-22T08:58:51Z)
Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well. While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z)
The Inductive Bias of Flatness Regularization for Deep Matrix Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks. We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z)
ReLU Neural Networks with Linear Layers are Biased Towards Single- and Multi-Index Models [9.96121040675476]
This manuscript explores how properties of functions learned by neural networks of depth greater than two layers affect predictions. Our framework considers a family of networks of varying depths that all have the same capacity but different representation costs.
arXiv Detail & Related papers (2023-05-24T22:10:12Z)
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two. For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z)
Plateau in Monotonic Linear Interpolation -- A "Biased" View of Loss Landscape for Deep Networks [18.71055320062469]
Monotonic linear (MLI) is a phenomenon that is commonly observed in the training of neural networks. We show that the MLI property is not necessarily related to the hardness of optimization problems. In particular, we show that interpolating both weights and biases linearly leads to very different influences on the final output.
arXiv Detail & Related papers (2022-10-03T15:33:29Z)
Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models. We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers. A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z)
Implicit Regularization Towards Rank Minimization in ReLU Networks [34.41953136999683]
We study the conjectured relationship between the implicit regularization in neural networks and rank minimization. We focus on nonlinear ReLU networks, providing several new positive and negative results.
arXiv Detail & Related papers (2022-01-30T09:15:44Z)
Dissecting Supervised Constrastive Learning [24.984074794337157]
Minimizing cross-entropy over the softmax scores of a linear map composed with a high-capacity encoder is arguably the most popular choice for training neural networks on supervised learning tasks. We show that one can directly optimize the encoder instead, to obtain equally (or even more) discriminative representations via a supervised variant of a contrastive objective.
arXiv Detail & Related papers (2021-02-17T15:22:38Z)
A case where a spindly two-layer linear network whips any neural network with a fully connected input layer [24.132345589750592]
We show that a sparse input layer is needed to sample efficiently learn sparse targets with gradient descent. Surprisingly the same type of problem can be solved drastically more efficient by a simple 2-layer linear neural network.
arXiv Detail & Related papers (2020-10-16T20:49:58Z)
Piecewise linear activations substantially shape the loss surfaces of neural networks [95.73230376153872]
This paper presents how piecewise linear activation functions substantially shape the loss surfaces of neural networks. We first prove that it the loss surfaces of many neural networks have infinite spurious local minima which are defined as the local minima with higher empirical risks than the global minima. For one-hidden-layer networks, we prove that all local minima in a cell constitute an equivalence class; they are concentrated in a valley; and they are all global minima in the cell.
arXiv Detail & Related papers (2020-03-27T04:59:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.