Related papers: Scaling laws for learning with real and surrogate data

Scaling laws for learning with real and surrogate data

URL: http://arxiv.org/abs/2402.04376v2
Date: Fri, 28 Jun 2024 15:36:50 GMT
Title: Scaling laws for learning with real and surrogate data
Authors: Ayush Jain, Andrea Montanari, Eren Sasoglu,
Abstract summary: We introduce a weighted empirical risk minimization (ERM) approach for integrating surrogate data into training. $(i)$ Integrating surrogate data can significantly reduce the test error on the original distribution. $(ii)$ In order to reap the benefit of surrogate data, it is crucial to use optimally weighted ERM.
Score: 12.617392961074096
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Collecting large quantities of high-quality data can be prohibitively expensive or impractical, and a bottleneck in machine learning. One may instead augment a small set of $n$ data points from the target distribution with data from more accessible sources, e.g. data collected under different circumstances or synthesized by generative models. We refer to such data as `surrogate data.' We introduce a weighted empirical risk minimization (ERM) approach for integrating surrogate data into training. We analyze mathematically this method under several classical statistical models, and validate our findings empirically on datasets from different domains. Our main findings are: $(i)$ Integrating surrogate data can significantly reduce the test error on the original distribution. Surprisingly, this can happen even when the surrogate data is unrelated to the original ones. We trace back this behavior to the classical Stein's paradox. $(ii)$ In order to reap the benefit of surrogate data, it is crucial to use optimally weighted ERM. $(iii)$ The test error of models trained on mixtures of real and surrogate data is approximately described by a scaling law. This scaling law can be used to predict the optimal weighting scheme, and to choose the amount of surrogate data to add.

Related papers

Hey, That's My Data! Label-Only Dataset Inference in Large Language Models [63.35066172530291]
CatShift is a label-only dataset-inference framework.<n>It capitalizes on catastrophic forgetting: the tendency of an LLM to overwrite previously learned knowledge when exposed to new data.
arXiv Detail & Related papers (2025-06-06T13:02:59Z)
Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework [10.317740844867913]
We build a simulator based on 472 language model pre-training runs with varying data compositions from the SlimPajama dataset. We observe that even simple acquisition functions can enable principled training decisions across training models from 20M to 1B kernels.
arXiv Detail & Related papers (2025-03-26T22:19:47Z)
Statistically Testing Training Data for Unwanted Error Patterns using Rule-Oriented Regression [0.5831737970661137]
We provide a method to test training data for flaws, to establish a trustworthy ground-truth for a subsequent training of machine learning models.<n>Our approach extends the abilities of conventional statistical testing by letting the test-condition'' be any condition to describe a pattern in the data.<n>We provide an open source implementation for demonstration and experiments.
arXiv Detail & Related papers (2025-03-24T09:52:36Z)
DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization [65.8915778873691]
conditional distributions is a central problem in machine learning.<n>We propose a new paradigm that integrates both paired and unpaired data.<n>We show that our approach can theoretically recover true conditional distributions with arbitrarily small error.
arXiv Detail & Related papers (2024-10-03T16:12:59Z)
Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic [99.3682210827572]
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully curated web datasets. Data curation strategies are typically developed agnostic of the available compute for training. We introduce neural scaling laws that account for the non-homogeneous nature of web data.
arXiv Detail & Related papers (2024-04-10T17:27:54Z)
From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition [64.59093444558549]
We propose a simple, easy-to-implement, two-step training pipeline that we call From Fake to Real. By training on real and synthetic data separately, FFR does not expose the model to the statistical differences between real and synthetic data. Our experiments show that FFR improves worst group accuracy over the state-of-the-art by up to 20% over three datasets.
arXiv Detail & Related papers (2023-08-08T19:52:28Z)
Machine Learning Force Fields with Data Cost Aware Training [94.78998399180519]
Machine learning force fields (MLFF) have been proposed to accelerate molecular dynamics (MD) simulation. Even for the most data-efficient MLFFs, reaching chemical accuracy can require hundreds of frames of force and energy labels. We propose a multi-stage computational framework -- ASTEROID, which lowers the data cost of MLFFs by leveraging a combination of cheap inaccurate data and expensive accurate data.
arXiv Detail & Related papers (2023-06-05T04:34:54Z)
Membership Inference Attacks against Synthetic Data through Overfitting Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution. We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z)
Towards a methodology for addressing missingness in datasets, with an application to demographic health datasets [0.0]
We present a methodology for tackling missing data problems using a combination of synthetic dataset generation, missing data imputation and deep learning methods. Our results show that models trained on synthetic and imputed datasets could make predictions with an accuracy of $83 %$ and $80 %$ on $a) $ an unseen real dataset and $b)$ an unseen reserved synthetic test dataset.
arXiv Detail & Related papers (2022-11-05T09:02:30Z)
Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis. We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z)
Information FOMO: The unhealthy fear of missing out on information. A method for removing misleading data for healthier models [0.0]
Misleading or unnecessary data can have out-sized impacts on the health or accuracy of Machine Learning (ML) models. We present a sequential selection method that identifies critically important information within a dataset. We find these instabilities are a result of the complexity of the underlying map and linked to extreme events and heavy tails.
arXiv Detail & Related papers (2022-08-27T19:43:53Z)
Easy Differentially Private Linear Regression [16.325734286930764]
We study an algorithm which uses the exponential mechanism to select a model with high Tukey depth from a collection of non-private regression models. We find that this algorithm obtains strong empirical performance in the data-rich setting.
arXiv Detail & Related papers (2022-08-15T17:42:27Z)
Datamodels: Predicting Predictions from Training Data [86.66720175866415]
We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data. We show that even simple linear datamodels can successfully predict model outputs.
arXiv Detail & Related papers (2022-02-01T18:15:24Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.