A Law of Data Reconstruction for Random Features (and Beyond)
- URL: http://arxiv.org/abs/2509.22214v1
- Date: Fri, 26 Sep 2025 11:29:23 GMT
- Title: A Law of Data Reconstruction for Random Features (and Beyond)
- Authors: Leonardo Iurada, Simone Bombari, Tatiana Tommasi, Marco Mondelli,
- Abstract summary: Large-scale deep learning models are known to memorize parts of the training set.<n>We show that this can be achieved when the number of parameters $p$ in the model is larger than the number of training samples $n$.<n>Our results reveal a law of data reconstruction, according to which the entire training dataset can be recovered as $p$ exceeds the threshold $dn$.
- Score: 35.943641163913206
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale deep learning models are known to memorize parts of the training set. In machine learning theory, memorization is often framed as interpolation or label fitting, and classical results show that this can be achieved when the number of parameters $p$ in the model is larger than the number of training samples $n$. In this work, we consider memorization from the perspective of data reconstruction, demonstrating that this can be achieved when $p$ is larger than $dn$, where $d$ is the dimensionality of the data. More specifically, we show that, in the random features model, when $p \gg dn$, the subspace spanned by the training samples in feature space gives sufficient information to identify the individual samples in input space. Our analysis suggests an optimization method to reconstruct the dataset from the model parameters, and we demonstrate that this method performs well on various architectures (random features, two-layer fully-connected and deep residual networks). Our results reveal a law of data reconstruction, according to which the entire training dataset can be recovered as $p$ exceeds the threshold $dn$.
Related papers
- Storage capacity of perceptron with variable selection [10.64866985260943]
A central challenge in machine learning is to distinguish genuine structure from chance correlations in high-dimensional data.<n>We show that a simple perceptron can perfectly classify $P = N$ random patterns by optimally selecting $M = N$ variables out of $N$ variables.<n>This not only provides a quantitative criterion for distinguishing true structure in the data from spurious regularities, but also yields the storage capacity of associative memory models.
arXiv Detail & Related papers (2025-12-01T16:44:57Z) - Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit [66.20349460098275]
We study the gradient descent learning of a general Gaussian Multi-index model $f(boldsymbolx)=g(boldsymbolUboldsymbolx)$ with hidden subspace $boldsymbolUin mathbbRrtimes d$.<n>We prove that under generic non-degenerate assumptions on the link function, a standard two-layer neural network trained via layer-wise gradient descent can agnostically learn the target with $o_d(1)$ test error.
arXiv Detail & Related papers (2025-11-19T04:46:47Z) - Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training [8.824077990271503]
We investigate the role of the training dynamics in the transition from generalization to memorization.<n>We find that $tau_mathrmmem$ increases linearly with the training set size $n$, while $tau_mathrmgen$ remains constant.<n>It is only when $n$ becomes larger than a model-dependent threshold that overfitting disappears at infinite training times.
arXiv Detail & Related papers (2025-05-23T08:58:47Z) - Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality [67.67387254989018]
We study model performance at various compute budgets and across multiple pre-training datasets created through data filtering and deduplication.<n>We find that, given appropriate modifications to the training recipe, repeating existing aggressively filtered datasets for up to ten epochs can outperform training on the ten times larger superset for a single epoch across multiple compute budget orders of magnitude.
arXiv Detail & Related papers (2025-03-10T21:51:17Z) - Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning
Interference with Gradient Projection [56.292071534857946]
Recent data-privacy laws have sparked interest in machine unlearning.
Challenge is to discard information about the forget'' data without altering knowledge about remaining dataset.
We adopt a projected-gradient based learning method, named as Projected-Gradient Unlearning (PGU)
We provide empirically evidence to demonstrate that our unlearning method can produce models that behave similar to models retrained from scratch across various metrics even when the training dataset is no longer accessible.
arXiv Detail & Related papers (2023-12-07T07:17:24Z) - How Deep Neural Networks Learn Compositional Data: The Random Hierarchy Model [47.617093812158366]
We introduce the Random Hierarchy Model: a family of synthetic tasks inspired by the hierarchical structure of language and images.
We find that deep networks learn the task by developing internal representations invariant to exchanging equivalent groups.
Our results indicate how deep networks overcome the curse of dimensionality by building invariant representations.
arXiv Detail & Related papers (2023-07-05T09:11:09Z) - Multi-Task Imitation Learning for Linear Dynamical Systems [50.124394757116605]
We study representation learning for efficient imitation learning over linear systems.
We find that the imitation gap over trajectories generated by the learned target policy is bounded by $tildeOleft( frack n_xHN_mathrmshared + frack n_uN_mathrmtargetright)$.
arXiv Detail & Related papers (2022-12-01T00:14:35Z) - Easy Differentially Private Linear Regression [16.325734286930764]
We study an algorithm which uses the exponential mechanism to select a model with high Tukey depth from a collection of non-private regression models.
We find that this algorithm obtains strong empirical performance in the data-rich setting.
arXiv Detail & Related papers (2022-08-15T17:42:27Z) - Datamodels: Predicting Predictions from Training Data [86.66720175866415]
We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data.
We show that even simple linear datamodels can successfully predict model outputs.
arXiv Detail & Related papers (2022-02-01T18:15:24Z) - Reproducible, incremental representation learning with Rosetta VAE [0.0]
Variational autoencoders are among the most popular methods for distilling low-dimensional structure from high-dimensional data.
We introduce the Rosetta VAE, a method of distilling previously learned representations and retraining new models to reproduce and build on prior results.
We demonstrate that the R-VAE reconstructs data as well as the VAE and $beta$-VAE, outperforms both methods in recovery of a target latent space in a sequential training setting.
arXiv Detail & Related papers (2022-01-13T20:45:35Z) - On Using Hamiltonian Monte Carlo Sampling for Reinforcement Learning
Problems in High-dimension [7.200655637873445]
Hamiltonian Monte Carlo (HMC) sampling offers a tractable way to generate data for training RL algorithms.
We introduce a framework, called textitHamiltonian $Q$-Learning, that demonstrates, both theoretically and empirically, that $Q$ values can be learned from a dataset generated by HMC samples of actions, rewards, and state transitions.
arXiv Detail & Related papers (2020-11-11T17:35:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.