Plateau in Monotonic Linear Interpolation -- A "Biased" View of Loss
Landscape for Deep Networks
- URL: http://arxiv.org/abs/2210.01019v1
- Date: Mon, 3 Oct 2022 15:33:29 GMT
- Title: Plateau in Monotonic Linear Interpolation -- A "Biased" View of Loss
Landscape for Deep Networks
- Authors: Xiang Wang, Annie N. Wang, Mo Zhou, Rong Ge
- Abstract summary: Monotonic linear (MLI) is a phenomenon that is commonly observed in the training of neural networks.
We show that the MLI property is not necessarily related to the hardness of optimization problems.
In particular, we show that interpolating both weights and biases linearly leads to very different influences on the final output.
- Score: 18.71055320062469
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Monotonic linear interpolation (MLI) - on the line connecting a random
initialization with the minimizer it converges to, the loss and accuracy are
monotonic - is a phenomenon that is commonly observed in the training of neural
networks. Such a phenomenon may seem to suggest that optimization of neural
networks is easy. In this paper, we show that the MLI property is not
necessarily related to the hardness of optimization problems, and empirical
observations on MLI for deep neural networks depend heavily on biases. In
particular, we show that interpolating both weights and biases linearly leads
to very different influences on the final output, and when different classes
have different last-layer biases on a deep network, there will be a long
plateau in both the loss and accuracy interpolation (which existing theory of
MLI cannot explain). We also show how the last-layer biases for different
classes can be different even on a perfectly balanced dataset using a simple
model. Empirically we demonstrate that similar intuitions hold on practical
networks and realistic datasets.
Related papers
- Simplicity bias and optimization threshold in two-layer ReLU networks [24.43739371803548]
We show that despite overparametrization, networks converge toward simpler solutions rather than interpolating the training data.
Our analysis relies on the so called early alignment phase, during which neurons align towards specific directions.
arXiv Detail & Related papers (2024-10-03T09:58:57Z) - Feature-Learning Networks Are Consistent Across Widths At Realistic
Scales [72.27228085606147]
We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets.
Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training.
We observe, however, that ensembles of narrower networks perform worse than a single wide network.
arXiv Detail & Related papers (2023-05-28T17:09:32Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - Training invariances and the low-rank phenomenon: beyond linear networks [44.02161831977037]
We show that when one trains a deep linear network with logistic or exponential loss on linearly separable data, the weights converge to rank-$1$ matrices.
This is the first time a low-rank phenomenon is proven rigorously for nonlinear ReLU-activated feedforward networks.
Our proof relies on a specific decomposition of the network into a multilinear function and another ReLU network whose weights are constant under a certain parameter directional convergence.
arXiv Detail & Related papers (2022-01-28T07:31:19Z) - Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks.
We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Analyzing Monotonic Linear Interpolation in Neural Network Loss
Landscapes [17.222244907679997]
We provide sufficient conditions for the MLI property under mean squared error.
While the MLI property holds under various settings, we produce in practice systematically violating the MLI property.
arXiv Detail & Related papers (2021-04-22T13:22:12Z) - The Low-Rank Simplicity Bias in Deep Networks [46.79964271742486]
We make a series of empirical observations that investigate and extend the hypothesis that deep networks are inductively biased to find solutions with lower effective rank embeddings.
We show that our claim holds true on finite width linear and non-linear models on practical learning paradigms and show that on natural data, these are often the solutions that generalize well.
arXiv Detail & Related papers (2021-03-18T17:58:02Z) - Over-parametrized neural networks as under-determined linear systems [31.69089186688224]
We show that it is unsurprising simple neural networks can achieve zero training loss.
We show that kernels typically associated with the ReLU activation function have fundamental flaws.
We propose new activation functions that avoid the pitfalls of ReLU in that they admit zero training loss solutions for any set of distinct data points.
arXiv Detail & Related papers (2020-10-29T21:43:00Z) - Optimizing Mode Connectivity via Neuron Alignment [84.26606622400423]
Empirically, the local minima of loss functions can be connected by a learned curve in model space along which the loss remains nearly constant.
We propose a more general framework to investigate effect of symmetry on landscape connectivity by accounting for the weight permutations of networks being connected.
arXiv Detail & Related papers (2020-09-05T02:25:23Z) - Piecewise linear activations substantially shape the loss surfaces of
neural networks [95.73230376153872]
This paper presents how piecewise linear activation functions substantially shape the loss surfaces of neural networks.
We first prove that it the loss surfaces of many neural networks have infinite spurious local minima which are defined as the local minima with higher empirical risks than the global minima.
For one-hidden-layer networks, we prove that all local minima in a cell constitute an equivalence class; they are concentrated in a valley; and they are all global minima in the cell.
arXiv Detail & Related papers (2020-03-27T04:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.