Datamodels: Predicting Predictions from Training Data
- URL: http://arxiv.org/abs/2202.00622v1
- Date: Tue, 1 Feb 2022 18:15:24 GMT
- Title: Datamodels: Predicting Predictions from Training Data
- Authors: Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc,
Aleksander Madry
- Abstract summary: We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data.
We show that even simple linear datamodels can successfully predict model outputs.
- Score: 86.66720175866415
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a conceptual framework, datamodeling, for analyzing the behavior
of a model class in terms of the training data. For any fixed "target" example
$x$, training set $S$, and learning algorithm, a datamodel is a parameterized
function $2^S \to \mathbb{R}$ that for any subset of $S' \subset S$ -- using
only information about which examples of $S$ are contained in $S'$ -- predicts
the outcome of training a model on $S'$ and evaluating on $x$. Despite the
potential complexity of the underlying process being approximated (e.g.,
end-to-end training and evaluation of deep neural networks), we show that even
simple linear datamodels can successfully predict model outputs. We then
demonstrate that datamodels give rise to a variety of applications, such as:
accurately predicting the effect of dataset counterfactuals; identifying
brittle predictions; finding semantically similar examples; quantifying
train-test leakage; and embedding data into a well-behaved and feature-rich
representation space. Data for this paper (including pre-computed datamodels as
well as raw predictions from four million trained deep neural networks) is
available at https://github.com/MadryLab/datamodels-data .
Related papers
- Aligning Model Properties via Conformal Risk Control [4.710921988115686]
Post-training alignment via human feedback shows promise, but is often limited to generative AI settings.
In traditional non-generative settings with numerical or categorical outputs, detecting misalignment through single-sample outputs remains challenging.
We propose interpreting model alignment through property testing, defining an aligned model $f$ as one belonging to a subset $mathcalP$ of functions.
arXiv Detail & Related papers (2024-06-26T22:24:46Z) - SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric
Positive Definite Space [47.65912121120524]
We propose a novel generative model, termed SPD-DDPM, to handle large-scale data.
Our model is able to estimate $p(X)$ unconditionally and flexibly without giving $y$.
Experiment results on toy data and real taxi data demonstrate that our models effectively fit the data distribution both unconditionally and unconditionally.
arXiv Detail & Related papers (2023-12-13T15:08:54Z) - Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks.
Such models tend to be large and require commensurate volumes of training data.
It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs.
Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z) - TRAK: Attributing Model Behavior at Scale [79.56020040993947]
We present TRAK (Tracing with Randomly-trained After Kernel), a data attribution method that is both effective and computationally tractable for large-scale, differenti models.
arXiv Detail & Related papers (2023-03-24T17:56:22Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Measuring the Effect of Training Data on Deep Learning Predictions via
Randomized Experiments [5.625056584412003]
We develop a principled algorithm for estimating the contribution of training data points to a deep learning model.
Our algorithm estimates the AME, a quantity that measures the expected (average) marginal effect of adding a data point to a subset of the training data.
arXiv Detail & Related papers (2022-06-20T21:27:18Z) - Supervised Machine Learning with Plausible Deniability [1.685485565763117]
We study the question of how well machine learning (ML) models trained on a certain data set provide privacy for the training data.
We show that one can take a set of purely random training data, and from this define a suitable learning rule'' that will produce a ML model that is exactly $f$.
arXiv Detail & Related papers (2021-06-08T11:54:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.