Related papers: Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction

Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction

URL: http://arxiv.org/abs/2206.07085v1
Date: Tue, 14 Jun 2022 18:19:05 GMT
Title: Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction
Authors: Kaifeng Lyu, Zhiyuan Li, Sanjeev Arora
Abstract summary: Normalization layers (e.g., Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets. This paper gives mathematical analysis and experiments suggesting that normalization encourages GD to reduce the sharpness of loss surface.
Score: 36.83448475700536
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Normalization layers (e.g., Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets, but they clearly also help generalization, even in not-so-deep nets. Motivated by the long-held belief that flatter minima lead to better generalization, this paper gives mathematical analysis and supporting experiments suggesting that normalization (together with accompanying weight-decay) encourages GD to reduce the sharpness of loss surface. Here "sharpness" is carefully defined given that the loss is scale-invariant, a known consequence of normalization. Specifically, for a fairly broad class of neural nets with normalization, our theory explains how GD with a finite learning rate enters the so-called Edge of Stability (EoS) regime, and characterizes the trajectory of GD in this regime via a continuous sharpness-reduction flow.

Related papers

Implicit Bias in Deep Linear Discriminant Analysis [2.3304389517825808]
This paper presents an initial theoretical analysis of the implicit regularization induced by the Deep LDA.<n>By analyzing the gradient flow of the loss on a L-layer diagonal linear network, we prove that under balanced initialization, the network architecture transforms standard additive gradient updates into multiplicative weight updates.
arXiv Detail & Related papers (2026-03-03T05:49:24Z)
The Hidden Power of Normalization: Exponential Capacity Control in Deep Neural Networks [3.2356128177594363]
We develop a theoretical framework that elucidates the role of normalization through the lens of capacity control.<n>We prove that an unnormalized DNN can exhibit exponentially large Lipschitz constants with respect to either its parameters or inputs.<n>In contrast, the insertion of normalization layers provably can reduce the Lipschitz constant at an exponential rate in the number of normalization operations.
arXiv Detail & Related papers (2025-11-02T14:38:20Z)
Conflicting Biases at the Edge of Stability: Norm versus Sharpness Regularization [12.58055746943097]
We argue that a comprehensive understanding of the generalization performance of gradient descent requires analyzing the interaction between these various forms of implicit regularization.<n>We prove for diagonal linear networks trained on a simple regression task that neither implicit bias alone minimizes the generalization error.
arXiv Detail & Related papers (2025-05-27T16:51:06Z)
Generalization of Scaled Deep ResNets in the Mean-Field Regime [55.77054255101667]
We investigate emphscaled ResNet in the limit of infinitely deep and wide neural networks. Our results offer new insights into the generalization ability of deep ResNet beyond the lazy training regime.
arXiv Detail & Related papers (2024-03-14T21:48:00Z)
The Inductive Bias of Flatness Regularization for Deep Matrix Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks. We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z)
Robust Implicit Regularization via Weight Normalization [5.37610807422229]
We show that weight normalization enables a robust bias that persists even if the weights are at practically large scale. Experiments suggest that the gains in both convergence speed and robustness of the implicit bias are improved dramatically by using weight normalization.
arXiv Detail & Related papers (2023-05-09T13:38:55Z)
Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability. We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z)
On generalization bounds for deep networks based on loss surface implicit regularization [5.68558935178946]
Modern deep neural networks generalize well despite a large number of parameters. That modern deep neural networks generalize well despite a large number of parameters contradicts the classical statistical learning theory.
arXiv Detail & Related papers (2022-01-12T16:41:34Z)
Vanishing Curvature and the Power of Adaptive Methods in Randomly Initialized Deep Networks [30.467121747150816]
This paper revisits the so-called vanishing gradient phenomenon, which commonly occurs in deep randomly neural networks. We first show that vanishing gradients cannot be circumvented when the network width scales with less than O(depth)
arXiv Detail & Related papers (2021-06-07T16:29:59Z)
Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically. This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression. We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z)
Optimization Theory for ReLU Neural Networks Trained with Normalization Layers [82.61117235807606]
The success of deep neural networks in part due to the use of normalization layers. Our analysis shows how the introduction of normalization changes the landscape and can enable faster activation.
arXiv Detail & Related papers (2020-06-11T23:55:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.