Bias of Stochastic Gradient Descent or the Architecture: Disentangling the Effects of Overparameterization of Neural Networks
- URL: http://arxiv.org/abs/2407.03848v1
- Date: Thu, 4 Jul 2024 11:29:50 GMT
- Title: Bias of Stochastic Gradient Descent or the Architecture: Disentangling the Effects of Overparameterization of Neural Networks
- Authors: Amit Peleg, Matthias Hein,
- Abstract summary: This paper aims to disentangle the factors that influence generalization by studying random and SGD-optimized networks that achieve zero training error.
We experimentally show, in the low sample regime, that over parameterization in terms of increasing width is beneficial for generalization.
For increasing depth, over parameterization is detrimental for generalization, but random and SGD-optimized networks behave similarly, so this can be attributed to an architectural bias.
- Score: 37.02386277426315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural networks typically generalize well when fitting the data perfectly, even though they are heavily overparameterized. Many factors have been pointed out as the reason for this phenomenon, including an implicit bias of stochastic gradient descent (SGD) and a possible simplicity bias arising from the neural network architecture. The goal of this paper is to disentangle the factors that influence generalization stemming from optimization and architectural choices by studying random and SGD-optimized networks that achieve zero training error. We experimentally show, in the low sample regime, that overparameterization in terms of increasing width is beneficial for generalization, and this benefit is due to the bias of SGD and not due to an architectural bias. In contrast, for increasing depth, overparameterization is detrimental for generalization, but random and SGD-optimized networks behave similarly, so this can be attributed to an architectural bias. For more information, see https://bias-sgd-or-architecture.github.io .
Related papers
- Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization [52.16435732772263]
Second-order optimization has been shown to accelerate the training of deep neural networks in many applications.
However, generalization properties of second-order methods are still being debated.
We show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep architectures.
arXiv Detail & Related papers (2024-11-12T17:58:40Z) - Automatic Gradient Descent: Deep Learning without Hyperparameters [35.350274248478804]
The architecture of a deep neural network is defined explicitly in terms of the number of layers, the width of each layer and the general network topology.
Paper builds a new framework for deriving objective functions: gradient idea is to transform a Bregman divergence to account for the non gradient structure of neural architecture.
arXiv Detail & Related papers (2023-04-11T12:45:52Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - Improving deep neural network generalization and robustness to
background bias via layer-wise relevance propagation optimization [0.0]
Features in images' backgrounds can spuriously correlate with the images' classes, representing background bias.
Deep neural networks (DNNs) that perform well on standard evaluation datasets but generalize poorly to real-world data.
We show that the optimization of LRP heatmaps can minimize the background bias influence on deep classifiers.
arXiv Detail & Related papers (2022-02-01T05:58:01Z) - On generalization bounds for deep networks based on loss surface
implicit regularization [5.68558935178946]
Modern deep neural networks generalize well despite a large number of parameters.
That modern deep neural networks generalize well despite a large number of parameters contradicts the classical statistical learning theory.
arXiv Detail & Related papers (2022-01-12T16:41:34Z) - On the Implicit Biases of Architecture & Gradient Descent [46.34988166338264]
This paper finds that while typical networks that fit the training data already generalise fairly well, gradient descent can further improve generalisation by selecting networks with a large margin.
New technical tools suggest a nuanced portrait of generalisation involving both the implicit biases of architecture and gradient descent.
arXiv Detail & Related papers (2021-10-08T17:36:37Z) - GradInit: Learning to Initialize Neural Networks for Stable and
Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks.
It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value.
It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z) - Inductive Bias of Gradient Descent for Exponentially Weight Normalized
Smooth Homogeneous Neural Nets [1.7259824817932292]
We analyze the inductive bias of gradient descent for weight normalized smooth homogeneous neural nets, when trained on exponential or cross-entropy loss.
This paper shows that the gradient flow path with EWN is equivalent to gradient flow on standard networks with an adaptive learning rate.
arXiv Detail & Related papers (2020-10-24T14:34:56Z) - When Does Preconditioning Help or Hurt Generalization? [74.25170084614098]
We show how the textitimplicit bias of first and second order methods affects the comparison of generalization properties.
We discuss several approaches to manage the bias-variance tradeoff, and the potential benefit of interpolating between GD and NGD.
arXiv Detail & Related papers (2020-06-18T17:57:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.