Data coarse graining can improve model performance
- URL: http://arxiv.org/abs/2509.14498v1
- Date: Thu, 18 Sep 2025 00:17:01 GMT
- Title: Data coarse graining can improve model performance
- Authors: Alex Nguyen, David J. Schwab, Vudtiwat Ngampruetikorn,
- Abstract summary: We study a paradox using a solvable model of high-dimensional, ridge-regularized linear regression under 'data coarse graining'<n>Inspired by the renormalization group in statistical physics, we analyze coarse-graining schemes that systematically discard features based on their relevance to the learning task.<n>Our results highlight a complex, nonmonotonic risk landscape shaped by the structure of the data, and illustrate how ideas from statistical physics provide a principled lens for understanding modern machine learning phenomena.
- Score: 7.325551965751601
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lossy data transformations by definition lose information. Yet, in modern machine learning, methods like data pruning and lossy data augmentation can help improve generalization performance. We study this paradox using a solvable model of high-dimensional, ridge-regularized linear regression under 'data coarse graining.' Inspired by the renormalization group in statistical physics, we analyze coarse-graining schemes that systematically discard features based on their relevance to the learning task. Our results reveal a nonmonotonic dependence of the prediction risk on the degree of coarse graining. A 'high-pass' scheme--which filters out less relevant, lower-signal features--can help models generalize better. By contrast, a 'low-pass' scheme that integrates out more relevant, higher-signal features is purely detrimental. Crucially, using optimal regularization, we demonstrate that this nonmonotonicity is a distinct effect of data coarse graining and not an artifact of double descent. Our framework offers a clear, analytical explanation for why careful data augmentation works: it strips away less relevant degrees of freedom and isolates more predictive signals. Our results highlight a complex, nonmonotonic risk landscape shaped by the structure of the data, and illustrate how ideas from statistical physics provide a principled lens for understanding modern machine learning phenomena.
Related papers
- Transformer Is Inherently a Causal Learner [27.79148022495734]
We show that transformer trained in an autoregressive manner naturally encodes time-delayed causal structures.<n>We prove this connection theoretically under standard identifiability conditions.<n>This approach greatly surpasses the performance of state-of-the-art discovery algorithms.
arXiv Detail & Related papers (2026-01-09T09:10:04Z) - Data Curation Through the Lens of Spectral Dynamics: Static Limits, Dynamic Acceleration, and Practical Oracles [16.678827833121602]
Large-scale neural models are increasingly trained with data pruning, synthetic data generation, cross-model distillation, reinforcement learning from human feedback (RLHF), and difficulty-based sampling.<n>We formalize data curation as reweighting the sampling distribution and map its effect onto the eigenstructure of the data-induced operator.
arXiv Detail & Related papers (2025-12-02T04:36:13Z) - Efficient Machine Unlearning via Influence Approximation [75.31015485113993]
Influence-based unlearning has emerged as a prominent approach to estimate the impact of individual training samples on model parameters without retraining.<n>This paper establishes a theoretical link between memorizing (incremental learning) and forgetting (unlearning)<n>We introduce the Influence Approximation Unlearning algorithm for efficient machine unlearning from the incremental perspective.
arXiv Detail & Related papers (2025-07-31T05:34:27Z) - A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops [55.07063067759609]
High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted.<n>Models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs)<n>Some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding.
arXiv Detail & Related papers (2025-02-26T06:18:13Z) - Marginal Causal Flows for Validation and Inference [3.547529079746247]
Investigating the marginal causal effect of an intervention on an outcome from complex data remains challenging.<n>We introduce Frugal Flows, a novel likelihood-based machine learning model that uses normalising flows to flexibly learn the data-generating process.<n>We demonstrate the above with experiments on both simulated and real-world datasets.
arXiv Detail & Related papers (2024-11-02T16:04:57Z) - DRoP: Distributionally Robust Data Pruning [11.930434318557156]
We conduct the first systematic study of the impact of data pruning on classification bias of trained models.<n>We propose DRoP, a distributionally robust approach to pruning and empirically demonstrate its performance on standard computer vision benchmarks.
arXiv Detail & Related papers (2024-04-08T14:55:35Z) - Hessian-Free Online Certified Unlearning [8.875278412741695]
We develop an online unlearning algorithm that achieves near-instantaneous data removal.<n>We prove that our proposed method outperforms the state-of-the-art methods in terms of the unlearning and generalization guarantees.
arXiv Detail & Related papers (2024-04-02T07:54:18Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - Information bottleneck theory of high-dimensional regression: relevancy,
efficiency and optimality [6.700873164609009]
Overfitting is a central challenge in machine learning, yet many large neural networks readily achieve zero training loss.
We quantify overfitting via residual information, defined as the bits in fitted models that encode noise in training data.
arXiv Detail & Related papers (2022-08-08T00:09:12Z) - Extension of Dynamic Mode Decomposition for dynamic systems with
incomplete information based on t-model of optimal prediction [69.81996031777717]
The Dynamic Mode Decomposition has proved to be a very efficient technique to study dynamic data.
The application of this approach becomes problematic if the available data is incomplete because some dimensions of smaller scale either missing or unmeasured.
We consider a first-order approximation of the Mori-Zwanzig decomposition, state the corresponding optimization problem and solve it with the gradient-based optimization method.
arXiv Detail & Related papers (2022-02-23T11:23:59Z) - Harmless interpolation in regression and classification with structured
features [21.064512161584872]
Overparametrized neural networks tend to perfectly fit noisy training data yet generalize well on test data.
We present a general and flexible framework for upper bounding regression and classification risk in a reproducing kernel Hilbert space.
arXiv Detail & Related papers (2021-11-09T15:12:26Z) - Provably Efficient Causal Reinforcement Learning with Confounded
Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting.
We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z) - Learning Causal Models Online [103.87959747047158]
Predictive models can rely on spurious correlations in the data for making predictions.
One solution for achieving strong generalization is to incorporate causal structures in the models.
We propose an online algorithm that continually detects and removes spurious features.
arXiv Detail & Related papers (2020-06-12T20:49:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.