Related papers: Overfitting has a limitation: a model-independent generalization error bound based on Rényi entropy

Overfitting has a limitation: a model-independent generalization error bound based on Rényi entropy

URL: http://arxiv.org/abs/2506.00182v1
Date: Fri, 30 May 2025 19:41:37 GMT
Title: Overfitting has a limitation: a model-independent generalization error bound based on Rényi entropy
Authors: Atsushi Suzuki,
Abstract summary: We introduce a model-independent upper bound for generalization error applicable to algorithms whose outputs are determined solely by the data's histogram.<n>This framework offers a direct explanation for the phenomenon where generalization performance degrades significantly upon injecting random noise into data.<n>We adapt the no-free-lunch theorem to be data-distribution-dependent, demonstrating that an amount of data corresponding to the R'enyi entropy is indeed essential for successful learning.
Score: 5.011668820933222
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Will further scaling up of machine learning models continue to bring success? A significant challenge in answering this question lies in understanding generalization error, which is the impact of overfitting. Understanding generalization error behavior of increasingly large-scale machine learning models remains a significant area of investigation, as conventional analyses often link error bounds to model complexity, failing to fully explain the success of extremely large architectures. This research introduces a novel perspective by establishing a model-independent upper bound for generalization error applicable to algorithms whose outputs are determined solely by the data's histogram, such as empirical risk minimization or gradient-based methods. Crucially, this bound is shown to depend only on the R\'enyi entropy of the data-generating distribution, suggesting that a small generalization error can be maintained even with arbitrarily large models, provided the data quantity is sufficient relative to this entropy. This framework offers a direct explanation for the phenomenon where generalization performance degrades significantly upon injecting random noise into data, where the performance degrade is attributed to the consequent increase in the data distribution's R\'enyi entropy. Furthermore, we adapt the no-free-lunch theorem to be data-distribution-dependent, demonstrating that an amount of data corresponding to the R\'enyi entropy is indeed essential for successful learning, thereby highlighting the tightness of our proposed generalization bound.

Related papers

Robust Molecular Property Prediction via Densifying Scarce Labeled Data [51.55434084913129]
In drug discovery, compounds most critical for advancing research often lie beyond the training set.<n>We propose a novel meta-learning-based approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data.<n>We demonstrate significant performance gains on challenging real-world datasets.
arXiv Detail & Related papers (2025-06-13T15:27:40Z)
Bigger Isn't Always Memorizing: Early Stopping Overparameterized Diffusion Models [51.03144354630136]
Generalization in natural data domains is progressively achieved during training before the onset of memorization.<n>Generalization vs. memorization is then best understood as a competition between time scales.<n>We show that this phenomenology is recovered in diffusion models learning a simple probabilistic context-free grammar with random rules.
arXiv Detail & Related papers (2025-05-22T17:40:08Z)
eGAD! double descent is explained by Generalized Aliasing Decomposition [0.0]
We introduce a novel decomposition that we call the generalized aliasing decomposition (GAD) to explain the relationship between predictive performance and model complexity.<n>GAD decomposes the predictive error into three parts: 1) model insufficiency, which dominates when the number of parameters is much smaller than the number of data points, 2) data insufficiency, which dominates when the number of parameters is much greater than the number of data points, and 3) generalized aliasing, which dominates between these two extremes.
arXiv Detail & Related papers (2024-08-15T17:49:24Z)
Parameter uncertainties for imperfect surrogate models in the low-noise regime [0.3069335774032178]
We analyze the generalization error of misspecified, near-deterministic surrogate models. We show posterior distributions must cover every training point to avoid a divergent generalization error. This is demonstrated on model problems before application to thousand dimensional datasets in atomistic machine learning.
arXiv Detail & Related papers (2024-02-02T11:41:21Z)
A PAC-Bayesian Perspective on the Interpolating Information Criterion [54.548058449535155]
We show how a PAC-Bayes bound is obtained for a general class of models, characterizing factors which influence performance in the interpolating regime. We quantify how the test error for overparameterized models achieving effectively zero training error depends on the quality of the implicit regularization imposed by e.g. the combination of model, parameter-initialization scheme.
arXiv Detail & Related papers (2023-11-13T01:48:08Z)
Advancing Counterfactual Inference through Nonlinear Quantile Regression [77.28323341329461]
We propose a framework for efficient and effective counterfactual inference implemented with neural networks. The proposed approach enhances the capacity to generalize estimated counterfactual outcomes to unseen data. Empirical results conducted on multiple datasets offer compelling support for our theoretical assertions.
arXiv Detail & Related papers (2023-06-09T08:30:51Z)
Tradeoff of generalization error in unsupervised learning [0.6554326244334868]
unsupervised learning generally exhibits a two-component tradeoff of the generalization error (GE) Using a more complex model reduces the model error at the cost of the data error, with the data error playing a more significant role for a smaller training dataset. Our results also indicate that the optimal model tends to be more complex when the data to be learned are more complex.
arXiv Detail & Related papers (2023-03-10T05:50:17Z)
ER: Equivariance Regularizer for Knowledge Graph Completion [107.51609402963072]
We propose a new regularizer, namely, Equivariance Regularizer (ER) ER can enhance the generalization ability of the model by employing the semantic equivariance between the head and tail entities. The experimental results indicate a clear and substantial improvement over the state-of-the-art relation prediction methods.
arXiv Detail & Related papers (2022-06-24T08:18:05Z)
Identification of Latent Variables From Graphical Model Residuals [0.0]
We present a novel method to control for the latent space when estimating a DAG by iteratively deriving proxies for the latent space from the residuals of the inferred model. We show that any improvement of prediction of an outcome is intrinsically capped and cannot rise beyond a certain limit as compared to the confounded model.
arXiv Detail & Related papers (2021-01-07T02:28:49Z)
Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition [34.235007566913396]
We describe an interpretable, symmetric decomposition of the variance into terms associated with the labels. We find that the bias decreases monotonically with the network width, but the variance terms exhibit non-monotonic behavior. We also analyze the strikingly rich phenomenology that arises.
arXiv Detail & Related papers (2020-11-04T21:04:02Z)
Accounting for Unobserved Confounding in Domain Generalization [107.0464488046289]
This paper investigates the problem of learning robust, generalizable prediction models from a combination of datasets. Part of the challenge of learning robust models lies in the influence of unobserved confounders. We demonstrate the empirical performance of our approach on healthcare data from different modalities.
arXiv Detail & Related papers (2020-07-21T08:18:06Z)
Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers. We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model. Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
Generalisation error in learning with random features and the hidden manifold model [23.71637173968353]
We study generalised linear regression and classification for a synthetically generated dataset. We consider the high-dimensional regime and using the replica method from statistical physics. We show how to obtain the so-called double descent behaviour for logistic regression with a peak at the threshold. We discuss the role played by correlations in the data generated by the hidden manifold model.
arXiv Detail & Related papers (2020-02-21T14:49:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.