Data Curation Through the Lens of Spectral Dynamics: Static Limits, Dynamic Acceleration, and Practical Oracles
- URL: http://arxiv.org/abs/2512.02409v1
- Date: Tue, 02 Dec 2025 04:36:13 GMT
- Title: Data Curation Through the Lens of Spectral Dynamics: Static Limits, Dynamic Acceleration, and Practical Oracles
- Authors: Yizhou Zhang, Lun Du,
- Abstract summary: Large-scale neural models are increasingly trained with data pruning, synthetic data generation, cross-model distillation, reinforcement learning from human feedback (RLHF), and difficulty-based sampling.<n>We formalize data curation as reweighting the sampling distribution and map its effect onto the eigenstructure of the data-induced operator.
- Score: 16.678827833121602
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale neural models are increasingly trained with data pruning, synthetic data generation, cross-model distillation, reinforcement learning from human feedback (RLHF), and difficulty-based sampling. While several of these data-centric strategies reliably improve training efficiency and downstream performance, others fail to provide meaningful gains -- most notably self-generated synthetic data, which often increases dataset volume without enhancing model capability. We formalize data curation as reweighting the sampling distribution and map its effect onto the eigenstructure of the data-induced operator. Our first main result shows that \textbf{static pruning induces a bounded operator and therefore cannot change the spectral tail exponent}; it provides at most finite-region improvements and cannot alter asymptotic neural scaling. Our second result analyzes \textbf{time-dependent data curation}, showing that an ideal oracle capable of tracking spectral residuals and continuously re-normalizing the tail can provably accelerate learning -- although practical systems can only approximate this behavior.
Related papers
- Cutting Through the Noise: On-the-fly Outlier Detection for Robust Training of Machine Learning Interatomic Potentials [0.6999740786886536]
We introduce an on-the-fly outlier detection scheme that automatically down-weights noisy samples, without requiring additional reference calculations.<n>We show that this approach prevents overfitting and matches the performance of iterative refinement baselines with significantly reduced overhead.<n>We validate its scalability by training a foundation model for organic chemistry on the SPICE dataset, where it reduces energy errors by a factor of three.
arXiv Detail & Related papers (2026-02-09T16:16:22Z) - Data coarse graining can improve model performance [7.325551965751601]
We study a paradox using a solvable model of high-dimensional, ridge-regularized linear regression under 'data coarse graining'<n>Inspired by the renormalization group in statistical physics, we analyze coarse-graining schemes that systematically discard features based on their relevance to the learning task.<n>Our results highlight a complex, nonmonotonic risk landscape shaped by the structure of the data, and illustrate how ideas from statistical physics provide a principled lens for understanding modern machine learning phenomena.
arXiv Detail & Related papers (2025-09-18T00:17:01Z) - Understanding Data Influence with Differential Approximation [63.817689230826595]
We introduce a new formulation to approximate a sample's influence by accumulating the differences in influence between consecutive learning steps, which we term Diff-In.<n>By employing second-order approximations, we approximate these difference terms with high accuracy while eliminating the need for model convexity required by existing methods.<n>Our theoretical analysis demonstrates that Diff-In achieves significantly lower approximation error compared to existing influence estimators.
arXiv Detail & Related papers (2025-08-20T11:59:32Z) - Lost in Retraining: Roaming the Parameter Space of Exponential Families Under Closed-Loop Learning [0.0]
We study closed-loop learning for models that belong to exponential families.<n>We show that maximum likelihood of the parameters endows sufficient statistics with the martingale property.<n>We show that this outcome may be prevented if the data contains at least one data point generated from a ground truth model.
arXiv Detail & Related papers (2025-06-25T17:12:22Z) - A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops [55.07063067759609]
High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted.<n>Models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs)<n>Some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding.
arXiv Detail & Related papers (2025-02-26T06:18:13Z) - DispFormer: A Pretrained Transformer Incorporating Physical Constraints for Dispersion Curve Inversion [56.64622091009756]
This study introduces DispFormer, a transformer-based neural network for $v_s$ profile inversion from Rayleigh-wave phase and group dispersion curves.<n>DispFormer processes dispersion data independently at each period, allowing it to handle varying lengths without requiring network modifications or strict alignment between training and testing datasets.
arXiv Detail & Related papers (2025-01-08T09:08:24Z) - Generative Expansion of Small Datasets: An Expansive Graph Approach [13.053285552524052]
We introduce an Expansive Synthesis model generating large-scale, information-rich datasets from minimal samples.
An autoencoder with self-attention layers and optimal transport refines distributional consistency.
Results show comparable performance, demonstrating the model's potential to augment training data effectively.
arXiv Detail & Related papers (2024-06-25T02:59:02Z) - Enhancing Activity Recognition After Stroke: Generative Adversarial Networks for Kinematic Data Augmentation [0.0]
Generalizability of machine learning models for wearable monitoring in stroke rehabilitation is often constrained by the limited scale and heterogeneity of available data.
Data augmentation addresses this challenge by adding computationally derived data to real data to enrich the variability represented in the training set.
This study employs Conditional Generative Adversarial Networks (cGANs) to create synthetic kinematic data from a publicly available dataset.
By training deep learning models on both synthetic and experimental data, we enhanced task classification accuracy: models incorporating synthetic data attained an overall accuracy of 80.0%, significantly higher than the 66.1% seen in models trained solely with real data
arXiv Detail & Related papers (2024-06-12T15:51:00Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - Truncated tensor Schatten p-norm based approach for spatiotemporal
traffic data imputation with complicated missing patterns [77.34726150561087]
We introduce four complicated missing patterns, including missing and three fiber-like missing cases according to the mode-drivenn fibers.
Despite nonity of the objective function in our model, we derive the optimal solutions by integrating alternating data-mputation method of multipliers.
arXiv Detail & Related papers (2022-05-19T08:37:56Z) - Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation.
We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation.
Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.