High-dimensional Analysis of Synthetic Data Selection
- URL: http://arxiv.org/abs/2510.08123v1
- Date: Thu, 09 Oct 2025 12:06:31 GMT
- Title: High-dimensional Analysis of Synthetic Data Selection
- Authors: Parham Rezaei, Filip Kovacevic, Francesco Locatello, Marco Mondelli,
- Abstract summary: We show that, for linear models, the covariance shift between the target distribution and the distribution of the synthetic data affects the generalization error.<n>Remarkably, the theoretical insights from linear models carry over to deep neural networks and generative models.
- Score: 44.67519806837088
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the progress in the development of generative models, their usefulness in creating synthetic data that improve prediction performance of classifiers has been put into question. Besides heuristic principles such as "synthetic data should be close to the real data distribution", it is actually not clear which specific properties affect the generalization error. Our paper addresses this question through the lens of high-dimensional regression. Theoretically, we show that, for linear models, the covariance shift between the target distribution and the distribution of the synthetic data affects the generalization error but, surprisingly, the mean shift does not. Furthermore we prove that, in some settings, matching the covariance of the target distribution is optimal. Remarkably, the theoretical insights from linear models carry over to deep neural networks and generative models. We empirically demonstrate that the covariance matching procedure (matching the covariance of the synthetic data with that of the data coming from the target distribution) performs well against several recent approaches for synthetic data selection, across training paradigms, architectures, datasets and generative models used for augmentation.
Related papers
- Error Propagation and Model Collapse in Diffusion Models: A Theoretical Study [27.894241484593735]
Recursively training on synthetic data has been observed to significantly degrade performance in a wide range of tasks.<n>We theoretically analyze this phenomenon in the setting of score-based diffusion models.
arXiv Detail & Related papers (2026-02-18T16:56:36Z) - Beyond Real Data: Synthetic Data through the Lens of Regularization [9.459299281438074]
Synthetic data can improve generalization when real data is scarce, but excessive reliance may introduce distributional mismatches that degrade performance.<n>We present a learning-theoretic framework to quantify the trade-off between synthetic and real data.
arXiv Detail & Related papers (2025-10-09T11:33:09Z) - Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning [77.120955854093]
We show that data diversity can be a strong predictor of generalization in language models.<n>We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients.<n>We present Prismatic Synthesis, a framework for generating diverse synthetic data.
arXiv Detail & Related papers (2025-05-26T16:05:10Z) - A Generalized Theory of Mixup for Structure-Preserving Synthetic Data [2.184775414778289]
We show that mixup can distort key statistical properties such as variance, potentially leading to unintended consequences in data synthesis.<n>We propose a novel mixup method that incorporates a generalized and flexible weighting scheme, better preserving the original data's structure.<n> Numerical experiments confirm that the new approach not only preserves the statistical characteristics of the original data but also sustains model performance across repeated synthesis.
arXiv Detail & Related papers (2025-03-03T14:28:50Z) - Distribution-Aware Data Expansion with Diffusion Models [55.979857976023695]
We propose DistDiff, a training-free data expansion framework based on the distribution-aware diffusion model.
DistDiff consistently enhances accuracy across a diverse range of datasets compared to models trained solely on original data.
arXiv Detail & Related papers (2024-03-11T14:07:53Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Understanding Pathologies of Deep Heteroskedastic Regression [25.509884677111344]
Heteroskedastic models predict both mean and residual noise for each data point.
At one extreme, these models fit all training data perfectly, eliminating residual noise entirely.
At the other, they overfit the residual noise while predicting a constant, uninformative mean.
We observe a lack of middle ground, suggesting a phase transition dependent on model regularization strength.
arXiv Detail & Related papers (2023-06-29T06:31:27Z) - Utility Theory of Synthetic Data Generation [12.511220449652384]
This paper bridges the practice-theory gap by establishing relevant utility theory in a statistical learning framework.<n>It considers two utility metrics: generalization and ranking of models trained on synthetic data.
arXiv Detail & Related papers (2023-05-17T07:49:16Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.