Related papers: Flatness is Necessary, Neural Collapse is Not: Rethinking Generalization via Grokking

Flatness is Necessary, Neural Collapse is Not: Rethinking Generalization via Grokking

URL: http://arxiv.org/abs/2509.17738v2
Date: Fri, 24 Oct 2025 15:43:11 GMT
Title: Flatness is Necessary, Neural Collapse is Not: Rethinking Generalization via Grokking
Authors: Ting Han, Linara Adilova, Henning Petzka, Jens Kleesiek, Michael Kamp,
Abstract summary: We find that while both neural collapse and relative flatness emerge near the onset of generalization, only flatness consistently predicts it.<n>We show theoretically that neural collapse leads to relative flatness under classical assumptions, explaining their empirical co-occurrence.<n>Our results support the view that relative flatness is a potentially necessary and more fundamental property for generalization, and demonstrate how grokking can serve as a powerful probe for isolating its geometric underpinnings.
Score: 14.213441786059327
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Neural collapse, i.e., the emergence of highly symmetric, class-wise clustered representations, is frequently observed in deep networks and is often assumed to reflect or enable generalization. In parallel, flatness of the loss landscape has been theoretically and empirically linked to generalization. Yet, the causal role of either phenomenon remains unclear: Are they prerequisites for generalization, or merely by-products of training dynamics? We disentangle these questions using grokking, a training regime in which memorization precedes generalization, allowing us to temporally separate generalization from training dynamics and we find that while both neural collapse and relative flatness emerge near the onset of generalization, only flatness consistently predicts it. Models encouraged to collapse or prevented from collapsing generalize equally well, whereas models regularized away from flat solutions exhibit delayed generalization, resembling grokking, even in architectures and datasets where it does not typically occur. Furthermore, we show theoretically that neural collapse leads to relative flatness under classical assumptions, explaining their empirical co-occurrence. Our results support the view that relative flatness is a potentially necessary and more fundamental property for generalization, and demonstrate how grokking can serve as a powerful probe for isolating its geometric underpinnings.

Related papers

Random-Matrix-Induced Simplicity Bias in Over-parameterized Variational Quantum Circuits [72.0643009153473]
We show that expressive variational ansatze enter a Haar-like universality class in which both observable expectation values and parameter gradients concentrate exponentially with system size.<n>As a consequence, the hypothesis class induced by such circuits collapses with high probability to a narrow family of near-constant functions.<n>We further show that this collapse is not unavoidable: tensor-structured VQCs, including tensor-network-based and tensor-hypernetwork parameterizations, lie outside the Haar-like universality class.
arXiv Detail & Related papers (2026-01-05T08:04:33Z)
Generalization Below the Edge of Stability: The Role of Data Geometry [60.147710896851045]
We show how data geometry controls generalization in ReLU networks trained below the edge of stability.<n>For data distributions supported on a mixture of low-dimensional balls, we derive generalization bounds that provably adapt to the intrinsic dimension.<n>Our results consolidate disparate empirical findings that have appeared in the literature.
arXiv Detail & Related papers (2025-10-20T21:40:36Z)
Feature Dynamics as Implicit Data Augmentation: A Depth-Decomposed View on Deep Neural Network Generalization [18.72807692009739]
We show that temporal consistency extends to unseen and corrupted data, but collapses when semantic structure is destroyed.<n>Together, these findings suggest a conceptual perspective that links feature dynamics to generalization.
arXiv Detail & Related papers (2025-09-24T17:23:56Z)
Flatness After All? [6.698677477097004]
We argue that generalization could be assessed by measuring flatness using a soft rank measure of the Hessian.<n>For non-calibrated models, we connect our flatness measure to the well-known Takeuchi Information Criterion and show that it still provides reliable estimates of generalization gaps for models that are not overly confident.
arXiv Detail & Related papers (2025-06-21T20:33:36Z)
Deep Learning is Not So Mysterious or Different [54.5330466151362]
We argue that anomalous generalization behaviour is not distinct to neural networks.<n>We present soft inductive biases as a key unifying principle in explaining these phenomena.<n>We also highlight how deep learning is relatively distinct in other ways, such as its ability for representation learning.
arXiv Detail & Related papers (2025-03-03T22:56:04Z)
Grokking at the Edge of Linear Separability [1.024113475677323]
grokking is delayed generalization accompanied by non-monotonic test loss behavior.<n>We find that grokking arises naturally even when the parameters of the problem are close to a critical point.
arXiv Detail & Related papers (2024-10-06T14:08:42Z)
When does compositional structure yield compositional generalization? A kernel theory [0.0]
We present a theory of compositional generalization in kernel models with fixed, compositionally structured representations.<n>We identify novel failure modes in compositional generalization that arise from biases in the training data.<n>This work examines how statistical structure in the training data can affect compositional generalization.
arXiv Detail & Related papers (2024-05-26T00:50:11Z)
Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation [59.138470433237615]
We introduce statistical metrics that quantify both the linguistic and visual skew of a dataset for relational learning. We show that systematically controlled metrics are strongly predictive of generalization performance. This work informs an important direction towards quality-enhancing the data diversity or balance to scaling up the absolute size.
arXiv Detail & Related papers (2024-03-25T03:18:39Z)
More Than a Toy: Random Matrix Models Predict How Real-World Neural Representations Generalize [94.70343385404203]
We find that most theoretical analyses fall short of capturing qualitative phenomena even for kernel regression. We prove that the classical GCV estimator converges to the generalization risk whenever a local random matrix law holds. Our findings suggest that random matrix theory may be central to understanding the properties of neural representations in practice.
arXiv Detail & Related papers (2022-03-11T18:59:01Z)
Generalization by design: Shortcuts to Generalization in Deep Learning [7.751691910877239]
We show that good generalization may be instigated by bounded spectral products over layers leading to a novel geometric regularizer. Backed up by theory we further demonstrate that "generalization by design" is practically possible and that good generalization may be encoded into the structure of the network.
arXiv Detail & Related papers (2021-07-05T20:01:23Z)
When Is Generalizable Reinforcement Learning Tractable? [74.87383727210705]
We study the query complexity required to train RL agents that can generalize to multiple environments. We introduce Strong Proximity, a structural condition which precisely characterizes the relative closeness of different environments. We show that under a natural weakening of this condition, RL can require query complexity that is exponential in the horizon to generalize.
arXiv Detail & Related papers (2021-01-01T19:08:24Z)
In Search of Robust Measures of Generalization [79.75709926309703]
We develop bounds on generalization error, optimization error, and excess risk. When evaluated empirically, most of these bounds are numerically vacuous. We argue that generalization measures should instead be evaluated within the framework of distributional robustness.
arXiv Detail & Related papers (2020-10-22T17:54:25Z)
Relative Flatness and Generalization [31.307340632319583]
Flatness of the loss curve is conjectured to be connected to the generalization ability of machine learning models. It is still an open theoretical problem why and under which circumstances flatness is connected to generalization.
arXiv Detail & Related papers (2020-01-03T11:39:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.