Related papers: Generalization Below the Edge of Stability: The Role of Data Geometry

Generalization Below the Edge of Stability: The Role of Data Geometry

URL: http://arxiv.org/abs/2510.18120v1
Date: Mon, 20 Oct 2025 21:40:36 GMT
Title: Generalization Below the Edge of Stability: The Role of Data Geometry
Authors: Tongtong Liang, Alexander Cloninger, Rahul Parhi, Yu-Xiang Wang,
Abstract summary: We show how data geometry controls generalization in ReLU networks trained below the edge of stability.<n>For data distributions supported on a mixture of low-dimensional balls, we derive generalization bounds that provably adapt to the intrinsic dimension.<n>Our results consolidate disparate empirical findings that have appeared in the literature.
Score: 60.147710896851045
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding generalization in overparameterized neural networks hinges on the interplay between the data geometry, neural architecture, and training dynamics. In this paper, we theoretically explore how data geometry controls this implicit bias. This paper presents theoretical results for overparameterized two-layer ReLU networks trained below the edge of stability. First, for data distributions supported on a mixture of low-dimensional balls, we derive generalization bounds that provably adapt to the intrinsic dimension. Second, for a family of isotropic distributions that vary in how strongly probability mass concentrates toward the unit sphere, we derive a spectrum of bounds showing that rates deteriorate as the mass concentrates toward the sphere. These results instantiate a unifying principle: When the data is harder to "shatter" with respect to the activation thresholds of the ReLU neurons, gradient descent tends to learn representations that capture shared patterns and thus finds solutions that generalize well. On the other hand, for data that is easily shattered (e.g., data supported on the sphere) gradient descent favors memorization. Our theoretical results consolidate disparate empirical findings that have appeared in the literature.

Related papers

The Inductive Bias of Convolutional Neural Networks: Locality and Weight Sharing Reshape Implicit Regularization [57.37943479039033]
We study how architectural inductive bias reshapes the implicit regularization induced by the edge-of-stability phenomenon in gradient descent.<n>We show that locality and weight sharing fundamentally change this picture.
arXiv Detail & Related papers (2026-03-05T04:50:51Z)
Understanding Generalization from Embedding Dimension and Distributional Convergence [13.491874401333021]
We study generalization from a representation-centric perspective and analyze how the geometry of learned embeddings controls predictive performance for a fixed trained model.<n>We show that population risk can be bounded by two factors: (i) the intrinsic dimension of the embedding distribution, which determines the convergence rate of empirical embedding distribution to the population distribution in Wasserstein distance, and (ii) the sensitivity of the downstream mapping from embeddings to predictions, characterized by Lipschitz constants.
arXiv Detail & Related papers (2026-01-30T09:32:04Z)
Low Rank Gradients and Where to Find Them [25.107551106396958]
We consider a spiked data model in which the bulk can be anisotropic and ill-conditioned.<n>We show that the gradient with respect to the input weights is approximately low rank.<n>We also demonstrate that standard regularizers, such as weight decay, input noise and Jacobian penalties, also selectively modulate these components.
arXiv Detail & Related papers (2025-10-01T16:20:19Z)
DispFormer: A Pretrained Transformer Incorporating Physical Constraints for Dispersion Curve Inversion [56.64622091009756]
This study introduces DispFormer, a transformer-based neural network for $v_s$ profile inversion from Rayleigh-wave phase and group dispersion curves.<n>DispFormer processes dispersion data independently at each period, allowing it to handle varying lengths without requiring network modifications or strict alignment between training and testing datasets.
arXiv Detail & Related papers (2025-01-08T09:08:24Z)
Losing dimensions: Geometric memorization in generative diffusion [10.573546162574235]
We extend the theory of memorization in generative diffusion to manifold-supported data. Our theoretical and experimental findings indicate that different subspaces are lost due to memorization effects at different critical times and dataset sizes. Perhaps counterintuitively, we find that, under some conditions, subspaces of higher variance are lost first due to memorization effects.
arXiv Detail & Related papers (2024-10-11T11:31:20Z)
Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation [59.138470433237615]
We introduce statistical metrics that quantify both the linguistic and visual skew of a dataset for relational learning. We show that systematically controlled metrics are strongly predictive of generalization performance. This work informs an important direction towards quality-enhancing the data diversity or balance to scaling up the absolute size.
arXiv Detail & Related papers (2024-03-25T03:18:39Z)
Gradient-Based Feature Learning under Structured Data [57.76552698981579]
In the anisotropic setting, the commonly used spherical gradient dynamics may fail to recover the true direction. We show that appropriate weight normalization that is reminiscent of batch normalization can alleviate this issue. In particular, under the spiked model with a suitably large spike, the sample complexity of gradient-based training can be made independent of the information exponent.
arXiv Detail & Related papers (2023-09-07T16:55:50Z)
Towards Understanding and Mitigating Dimensional Collapse in Heterogeneous Federated Learning [112.69497636932955]
Federated learning aims to train models across different clients without the sharing of data for privacy considerations. We study how data heterogeneity affects the representations of the globally aggregated models. We propose sc FedDecorr, a novel method that can effectively mitigate dimensional collapse in federated learning.
arXiv Detail & Related papers (2022-10-01T09:04:17Z)
The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data. We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z)
Uniform Convergence, Adversarial Spheres and a Simple Remedy [40.44709296304123]
Previous work has cast doubt on the general framework of uniform convergence and its ability to explain generalization in neural networks. We provide an extensive theoretical investigation of the previously studied data setting through the lens of infinitely-wide models. We prove that the Neural Tangent Kernel (NTK) also suffers from the same phenomenon and we uncover its origin.
arXiv Detail & Related papers (2021-05-07T20:23:01Z)
Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization [15.2292571922932]
We propose an approach to answering this question based on a hypothesis about the dynamics of gradient descent. We show that changes to the network parameters during training are biased towards those that (locally) simultaneously benefit many examples.
arXiv Detail & Related papers (2020-02-25T03:59:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.