Understanding Influence Functions and Datamodels via Harmonic Analysis
- URL: http://arxiv.org/abs/2210.01072v1
- Date: Mon, 3 Oct 2022 16:45:33 GMT
- Title: Understanding Influence Functions and Datamodels via Harmonic Analysis
- Authors: Nikunj Saunshi, Arushi Gupta, Mark Braverman, Sanjeev Arora
- Abstract summary: Influence functions estimate effect of individual data points on predictions of the model on test data.
They have been used for detecting data poisoning, detecting helpful and harmful examples, influence of groups of datapoints, etc.
Recently, Ilyas et al. [2022] introduced a linear regression method they termed datamodels to predict the effect of training points on outputs on test data.
This paper seeks to provide a better theoretical understanding of such interesting empirical phenomena.
- Score: 36.86262318584668
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Influence functions estimate effect of individual data points on predictions
of the model on test data and were adapted to deep learning in Koh and Liang
[2017]. They have been used for detecting data poisoning, detecting helpful and
harmful examples, influence of groups of datapoints, etc. Recently, Ilyas et
al. [2022] introduced a linear regression method they termed datamodels to
predict the effect of training points on outputs on test data. The current
paper seeks to provide a better theoretical understanding of such interesting
empirical phenomena. The primary tool is harmonic analysis and the idea of
noise stability. Contributions include: (a) Exact characterization of the
learnt datamodel in terms of Fourier coefficients. (b) An efficient method to
estimate the residual error and quality of the optimum linear datamodel without
having to train the datamodel. (c) New insights into when influences of groups
of datapoints may or may not add up linearly.
Related papers
- Influence Functions for Scalable Data Attribution in Diffusion Models [52.92223039302037]
Diffusion models have led to significant advancements in generative modelling.
Yet their widespread adoption poses challenges regarding data attribution and interpretability.
In this paper, we aim to help address such challenges by developing an textitinfluence functions framework.
arXiv Detail & Related papers (2024-10-17T17:59:02Z) - In-Context Probing Approximates Influence Function for Data Valuation [16.404477234171733]
We show that data valuation through in-context probing approximates influence functions for selecting training data.
Our empirical findings show that in-context probing and gradient-based influence frameworks are similar in how they rank training data.
arXiv Detail & Related papers (2024-07-17T02:06:56Z) - Explainability of Machine Learning Models under Missing Data [2.880748930766428]
Missing data is a prevalent issue that can significantly impair model performance and interpretability.
This paper briefly summarizes the development of the field of missing data and investigates the effects of various imputation methods on the calculation of Shapley values.
arXiv Detail & Related papers (2024-06-29T11:31:09Z) - Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models [36.05242956018461]
In this paper, we establish a bridge between identifying detrimental training samples via influence functions and outlier gradient detection.
We first validate the hypothesis of our proposed outlier gradient analysis approach on synthetic datasets.
We then demonstrate its effectiveness in detecting mislabeled samples in vision models and selecting data samples for improving performance of natural language processing transformer models.
arXiv Detail & Related papers (2024-05-06T21:34:46Z) - Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages.
Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z) - The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes [30.30769701138665]
We introduce and explore the Mirrored Influence Hypothesis, highlighting a reciprocal nature of influence between training and test data.
Specifically, it suggests that evaluating the influence of training data on test predictions can be reformulated as an equivalent, yet inverse problem.
We introduce a new method for estimating the influence of training data, which requires calculating gradients for specific test samples, paired with a forward pass for each training point.
arXiv Detail & Related papers (2024-02-14T03:43:05Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - Measuring Causal Effects of Data Statistics on Language Model's
`Factual' Predictions [59.284907093349425]
Large amounts of training data are one of the major reasons for the high performance of state-of-the-art NLP models.
We provide a language for describing how training data influences predictions, through a causal framework.
Our framework bypasses the need to retrain expensive models and allows us to estimate causal effects based on observational data alone.
arXiv Detail & Related papers (2022-07-28T17:36:24Z) - Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines.
Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z) - How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance.
We formulate a quality measure for the data set, which we refer to as $rho$-gap.
We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.