Related papers: Datamodels: Predicting Predictions from Training Data

Datamodels: Predicting Predictions from Training Data

URL: http://arxiv.org/abs/2202.00622v1
Date: Tue, 1 Feb 2022 18:15:24 GMT
Title: Datamodels: Predicting Predictions from Training Data
Authors: Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, Aleksander Madry
Abstract summary: We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data. We show that even simple linear datamodels can successfully predict model outputs.
Score: 86.66720175866415
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data. For any fixed "target" example $x$, training set $S$, and learning algorithm, a datamodel is a parameterized function $2^S \to \mathbb{R}$ that for any subset of $S' \subset S$ -- using only information about which examples of $S$ are contained in $S'$ -- predicts the outcome of training a model on $S'$ and evaluating on $x$. Despite the potential complexity of the underlying process being approximated (e.g., end-to-end training and evaluation of deep neural networks), we show that even simple linear datamodels can successfully predict model outputs. We then demonstrate that datamodels give rise to a variety of applications, such as: accurately predicting the effect of dataset counterfactuals; identifying brittle predictions; finding semantically similar examples; quantifying train-test leakage; and embedding data into a well-behaved and feature-rich representation space. Data for this paper (including pre-computed datamodels as well as raw predictions from four million trained deep neural networks) is available at https://github.com/MadryLab/datamodels-data .

Related papers

DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
Aligning Model Properties via Conformal Risk Control [4.710921988115686]
Post-training alignment via human feedback shows promise, but is often limited to generative AI settings. In traditional non-generative settings with numerical or categorical outputs, detecting misalignment through single-sample outputs remains challenging. We propose interpreting model alignment through property testing, defining an aligned model $f$ as one belonging to a subset $mathcalP$ of functions.
arXiv Detail & Related papers (2024-06-26T22:24:46Z)
SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric Positive Definite Space [47.65912121120524]
We propose a novel generative model, termed SPD-DDPM, to handle large-scale data. Our model is able to estimate $p(X)$ unconditionally and flexibly without giving $y$. Experiment results on toy data and real taxi data demonstrate that our models effectively fit the data distribution both unconditionally and unconditionally.
arXiv Detail & Related papers (2023-12-13T15:08:54Z)
Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data. It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs. Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z)
TRAK: Attributing Model Behavior at Scale [79.56020040993947]
We present TRAK (Tracing with Randomly-trained After Kernel), a data attribution method that is both effective and computationally tractable for large-scale, differenti models.
arXiv Detail & Related papers (2023-03-24T17:56:22Z)
Synthetic Model Combination: An Instance-wise Approach to Unsupervised Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data. Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z)
Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis. We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z)
Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Experiments [5.625056584412003]
We develop a principled algorithm for estimating the contribution of training data points to a deep learning model. Our algorithm estimates the AME, a quantity that measures the expected (average) marginal effect of adding a data point to a subset of the training data.
arXiv Detail & Related papers (2022-06-20T21:27:18Z)
Supervised Machine Learning with Plausible Deniability [1.685485565763117]
We study the question of how well machine learning (ML) models trained on a certain data set provide privacy for the training data. We show that one can take a set of purely random training data, and from this define a suitable learning rule'' that will produce a ML model that is exactly $f$.
arXiv Detail & Related papers (2021-06-08T11:54:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.