Measuring the Effect of Training Data on Deep Learning Predictions via
Randomized Experiments
- URL: http://arxiv.org/abs/2206.10013v1
- Date: Mon, 20 Jun 2022 21:27:18 GMT
- Title: Measuring the Effect of Training Data on Deep Learning Predictions via
Randomized Experiments
- Authors: Jinkun Lin, Anqi Zhang, Mathias Lecuyer, Jinyang Li, Aurojit Panda,
Siddhartha Sen
- Abstract summary: We develop a principled algorithm for estimating the contribution of training data points to a deep learning model.
Our algorithm estimates the AME, a quantity that measures the expected (average) marginal effect of adding a data point to a subset of the training data.
- Score: 5.625056584412003
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We develop a new, principled algorithm for estimating the contribution of
training data points to the behavior of a deep learning model, such as a
specific prediction it makes. Our algorithm estimates the AME, a quantity that
measures the expected (average) marginal effect of adding a data point to a
subset of the training data, sampled from a given distribution. When subsets
are sampled from the uniform distribution, the AME reduces to the well-known
Shapley value. Our approach is inspired by causal inference and randomized
experiments: we sample different subsets of the training data to train multiple
submodels, and evaluate each submodel's behavior. We then use a LASSO
regression to jointly estimate the AME of each data point, based on the subset
compositions. Under sparsity assumptions ($k \ll N$ datapoints have large AME),
our estimator requires only $O(k\log N)$ randomized submodel trainings,
improving upon the best prior Shapley value estimators.
Related papers
- Revisiting Score Function Estimators for $k$-Subset Sampling [5.464421236280698]
We show how to efficiently compute the $k$-subset distribution's score function using a discrete Fourier transform.
The resulting estimator provides both exact samples and unbiased gradient estimates.
Experiments in feature selection show results competitive with current methods, despite weaker assumptions.
arXiv Detail & Related papers (2024-07-22T21:26:39Z) - Rejection via Learning Density Ratios [50.91522897152437]
Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions.
We propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance.
Our framework is tested empirically over clean and noisy datasets.
arXiv Detail & Related papers (2024-05-29T01:32:17Z) - The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes [30.30769701138665]
We introduce and explore the Mirrored Influence Hypothesis, highlighting a reciprocal nature of influence between training and test data.
Specifically, it suggests that evaluating the influence of training data on test predictions can be reformulated as an equivalent, yet inverse problem.
We introduce a new method for estimating the influence of training data, which requires calculating gradients for specific test samples, paired with a forward pass for each training point.
arXiv Detail & Related papers (2024-02-14T03:43:05Z) - A Meta-Learning Approach to Predicting Performance and Data Requirements [163.4412093478316]
We propose an approach to estimate the number of samples required for a model to reach a target performance.
We find that the power law, the de facto principle to estimate model performance, leads to large error when using a small dataset.
We introduce a novel piecewise power law (PPL) that handles the two data differently.
arXiv Detail & Related papers (2023-03-02T21:48:22Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Benign-Overfitting in Conditional Average Treatment Effect Prediction
with Linear Regression [14.493176427999028]
We study the benign overfitting theory in the prediction of the conditional average treatment effect (CATE) with linear regression models.
We show that the T-learner fails to achieve the consistency except the random assignment, while the IPW-learner converges the risk to zero if the propensity score is known.
arXiv Detail & Related papers (2022-02-10T18:51:52Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z) - Datamodels: Predicting Predictions from Training Data [86.66720175866415]
We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data.
We show that even simple linear datamodels can successfully predict model outputs.
arXiv Detail & Related papers (2022-02-01T18:15:24Z) - Unrolling Particles: Unsupervised Learning of Sampling Distributions [102.72972137287728]
Particle filtering is used to compute good nonlinear estimates of complex systems.
We show in simulations that the resulting particle filter yields good estimates in a wide range of scenarios.
arXiv Detail & Related papers (2021-10-06T16:58:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.