Relative Flatness and Generalization
- URL: http://arxiv.org/abs/2001.00939v4
- Date: Thu, 4 Nov 2021 15:00:25 GMT
- Title: Relative Flatness and Generalization
- Authors: Henning Petzka, Michael Kamp, Linara Adilova, Cristian Sminchisescu,
Mario Boley
- Abstract summary: Flatness of the loss curve is conjectured to be connected to the generalization ability of machine learning models.
It is still an open theoretical problem why and under which circumstances flatness is connected to generalization.
- Score: 31.307340632319583
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Flatness of the loss curve is conjectured to be connected to the
generalization ability of machine learning models, in particular neural
networks. While it has been empirically observed that flatness measures
consistently correlate strongly with generalization, it is still an open
theoretical problem why and under which circumstances flatness is connected to
generalization, in particular in light of reparameterizations that change
certain flatness measures but leave generalization unchanged. We investigate
the connection between flatness and generalization by relating it to the
interpolation from representative data, deriving notions of representativeness,
and feature robustness. The notions allow us to rigorously connect flatness and
generalization and to identify conditions under which the connection holds.
Moreover, they give rise to a novel, but natural relative flatness measure that
correlates strongly with generalization, simplifies to ridge regression for
ordinary least squares, and solves the reparameterization issue.
Related papers
- The Inductive Bias of Convolutional Neural Networks: Locality and Weight Sharing Reshape Implicit Regularization [57.37943479039033]
We study how architectural inductive bias reshapes the implicit regularization induced by the edge-of-stability phenomenon in gradient descent.<n>We show that locality and weight sharing fundamentally change this picture.
arXiv Detail & Related papers (2026-03-05T04:50:51Z) - Random-Matrix-Induced Simplicity Bias in Over-parameterized Variational Quantum Circuits [72.0643009153473]
We show that expressive variational ansatze enter a Haar-like universality class in which both observable expectation values and parameter gradients concentrate exponentially with system size.<n>As a consequence, the hypothesis class induced by such circuits collapses with high probability to a narrow family of near-constant functions.<n>We further show that this collapse is not unavoidable: tensor-structured VQCs, including tensor-network-based and tensor-hypernetwork parameterizations, lie outside the Haar-like universality class.
arXiv Detail & Related papers (2026-01-05T08:04:33Z) - Generalization Below the Edge of Stability: The Role of Data Geometry [60.147710896851045]
We show how data geometry controls generalization in ReLU networks trained below the edge of stability.<n>For data distributions supported on a mixture of low-dimensional balls, we derive generalization bounds that provably adapt to the intrinsic dimension.<n>Our results consolidate disparate empirical findings that have appeared in the literature.
arXiv Detail & Related papers (2025-10-20T21:40:36Z) - Flatness is Necessary, Neural Collapse is Not: Rethinking Generalization via Grokking [14.213441786059327]
We find that while both neural collapse and relative flatness emerge near the onset of generalization, only flatness consistently predicts it.<n>We show theoretically that neural collapse leads to relative flatness under classical assumptions, explaining their empirical co-occurrence.<n>Our results support the view that relative flatness is a potentially necessary and more fundamental property for generalization, and demonstrate how grokking can serve as a powerful probe for isolating its geometric underpinnings.
arXiv Detail & Related papers (2025-09-22T13:05:07Z) - Generalized Linear Mode Connectivity for Transformers [87.32299363530996]
A striking phenomenon is linear mode connectivity (LMC), where independently trained models can be connected by low- or zero-loss paths.<n>Prior work has predominantly focused on neuron re-ordering through permutations, but such approaches are limited in scope.<n>We introduce a unified framework that captures four symmetry classes: permutations, semi-permutations, transformations, and general invertible maps.<n>This generalization enables, for the first time, the discovery of low- and zero-barrier linear paths between independently trained Vision Transformers and GPT-2 models.
arXiv Detail & Related papers (2025-06-28T01:46:36Z) - Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation [59.138470433237615]
We introduce statistical metrics that quantify both the linguistic and visual skew of a dataset for relational learning.
We show that systematically controlled metrics are strongly predictive of generalization performance.
This work informs an important direction towards quality-enhancing the data diversity or balance to scaling up the absolute size.
arXiv Detail & Related papers (2024-03-25T03:18:39Z) - A U-turn on Double Descent: Rethinking Parameter Counting in Statistical
Learning [68.76846801719095]
We show that double descent appears exactly when and where it occurs, and that its location is not inherently tied to the threshold p=n.
This provides a resolution to tensions between double descent and statistical intuition.
arXiv Detail & Related papers (2023-10-29T12:05:39Z) - FAM: Relative Flatness Aware Minimization [5.132856559837775]
optimizing for flatness has been proposed as early as 1994 by Hochreiter and Schmidthuber.
Recent theoretical work suggests that a particular relative flatness measure can be connected to generalization.
We derive a regularizer based on this relative flatness that is easy to compute, fast, efficient, and works with arbitrary loss functions.
arXiv Detail & Related papers (2023-07-05T14:48:24Z) - The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks.
We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z) - Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks.
We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space.
We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z) - Scale-invariant Bayesian Neural Networks with Connectivity Tangent
Kernel [30.088226334627375]
We show that flatness and generalization bounds can be changed arbitrarily according to the scale of a parameter.
We propose new prior and posterior distributions invariant to scaling transformations by textitdecomposing the scale and connectivity of parameters.
We empirically demonstrate our posterior provides effective flatness and calibration measures with low complexity.
arXiv Detail & Related papers (2022-09-30T03:31:13Z) - Why Flatness Correlates With Generalization For Deep Neural Networks [0.0]
We argue that local flatness measures correlate with generalization because they are local approximations to a global property.
For functions that give zero error on a test set, it is directly proportional to the Bayesian posterior.
Some variants of SGD can break the flatness-generalization correlation, while the volume-generalization correlation remains intact.
arXiv Detail & Related papers (2021-03-10T17:44:52Z) - Implicit Regularization in Tensor Factorization [17.424619189180675]
Implicit regularization in deep learning is perceived as a tendency of gradient-based optimization to fit training data with predictors of minimal "complexity"
We argue that tensor rank may pave way to explaining both implicit regularization of neural networks, and the properties of real-world data translating it to generalization.
arXiv Detail & Related papers (2021-02-19T15:10:26Z) - Dimension Free Generalization Bounds for Non Linear Metric Learning [61.193693608166114]
We provide uniform generalization bounds for two regimes -- the sparse regime, and a non-sparse regime.
We show that by relying on a different, new property of the solutions, it is still possible to provide dimension free generalization guarantees.
arXiv Detail & Related papers (2021-02-07T14:47:00Z) - Implicit Regularization in ReLU Networks with the Square Loss [56.70360094597169]
We show that it is impossible to characterize the implicit regularization with the square loss by any explicit function of the model parameters.
Our results suggest that a more general framework may be needed to understand implicit regularization for nonlinear predictors.
arXiv Detail & Related papers (2020-12-09T16:48:03Z) - Overparameterization and generalization error: weighted trigonometric
interpolation [4.631723879329972]
We study a random Fourier series model, where the task is to estimate the unknown Fourier coefficients from equidistant samples.
We show precisely how a bias towards smooth interpolants, in the form of weighted trigonometric generalization, can lead to smaller generalization error.
arXiv Detail & Related papers (2020-06-15T15:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.