The Low-Rank Simplicity Bias in Deep Networks
- URL: http://arxiv.org/abs/2103.10427v4
- Date: Thu, 23 Mar 2023 14:21:02 GMT
- Title: The Low-Rank Simplicity Bias in Deep Networks
- Authors: Minyoung Huh, Hossein Mobahi, Richard Zhang, Brian Cheung, Pulkit
Agrawal, Phillip Isola
- Abstract summary: We make a series of empirical observations that investigate and extend the hypothesis that deep networks are inductively biased to find solutions with lower effective rank embeddings.
We show that our claim holds true on finite width linear and non-linear models on practical learning paradigms and show that on natural data, these are often the solutions that generalize well.
- Score: 46.79964271742486
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Modern deep neural networks are highly over-parameterized compared to the
data on which they are trained, yet they often generalize remarkably well. A
flurry of recent work has asked: why do deep networks not overfit to their
training data? In this work, we make a series of empirical observations that
investigate and extend the hypothesis that deeper networks are inductively
biased to find solutions with lower effective rank embeddings. We conjecture
that this bias exists because the volume of functions that maps to low
effective rank embedding increases with depth. We show empirically that our
claim holds true on finite width linear and non-linear models on practical
learning paradigms and show that on natural data, these are often the solutions
that generalize well. We then show that the simplicity bias exists at both
initialization and after training and is resilient to hyper-parameters and
learning methods. We further demonstrate how linear over-parameterization of
deep non-linear models can be used to induce low-rank bias, improving
generalization performance on CIFAR and ImageNet without changing the modeling
capacity.
Related papers
- More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is Obligatory [12.689249854199982]
We show that the test risk of RF regression decreases monotonically with both the number of features and the number of samples.
We then demonstrate that, for a large class of tasks characterized by powerlaw eigenstructure, training to near-zero training loss is obligatory.
arXiv Detail & Related papers (2023-11-24T18:27:41Z) - Relearning Forgotten Knowledge: on Forgetting, Overfit and Training-Free
Ensembles of DNNs [9.010643838773477]
We introduce a novel score for quantifying overfit, which monitors the forgetting rate of deep models on validation data.
We show that overfit can occur with and without a decrease in validation accuracy, and may be more common than previously appreciated.
We use our observations to construct a new ensemble method, based solely on the training history of a single network, which provides significant improvement without any additional cost in training time.
arXiv Detail & Related papers (2023-10-17T09:22:22Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - Learning Low Dimensional State Spaces with Overparameterized Recurrent
Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory.
Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Plateau in Monotonic Linear Interpolation -- A "Biased" View of Loss
Landscape for Deep Networks [18.71055320062469]
Monotonic linear (MLI) is a phenomenon that is commonly observed in the training of neural networks.
We show that the MLI property is not necessarily related to the hardness of optimization problems.
In particular, we show that interpolating both weights and biases linearly leads to very different influences on the final output.
arXiv Detail & Related papers (2022-10-03T15:33:29Z) - Overparameterized ReLU Neural Networks Learn the Simplest Models: Neural
Isometry and Exact Recovery [33.74925020397343]
Deep learning has shown that neural networks generalize remarkably well even with an extreme number of learned parameters.
We consider the training and generalization properties of two-layer ReLU networks with standard weight decay regularization.
We show that ReLU networks learn simple and sparse models even when the labels are noisy.
arXiv Detail & Related papers (2022-09-30T06:47:15Z) - Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks.
We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z) - Evading the Simplicity Bias: Training a Diverse Set of Models Discovers
Solutions with Superior OOD Generalization [93.8373619657239]
Neural networks trained with SGD were recently shown to rely preferentially on linearly-predictive features.
This simplicity bias can explain their lack of robustness out of distribution (OOD)
We demonstrate that the simplicity bias can be mitigated and OOD generalization improved.
arXiv Detail & Related papers (2021-05-12T12:12:24Z) - Gradient Descent for Deep Matrix Factorization: Dynamics and Implicit
Bias towards Low Rank [1.9350867959464846]
In deep learning, gradientdescent tends to prefer solutions which generalize well.
In this paper we analyze the dynamics of gradient descent in the simplifiedsetting of linear networks and of an estimation problem.
arXiv Detail & Related papers (2020-11-27T15:08:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.