Related papers: Data Valuation Without Training of a Model

Data Valuation Without Training of a Model

URL: http://arxiv.org/abs/2301.00930v1
Date: Tue, 3 Jan 2023 02:19:20 GMT
Title: Data Valuation Without Training of a Model
Authors: Nohyun Ki, Hoyong Choi and Hye Won Chung
Abstract summary: We propose a training-free data valuation score, called complexity-gap score, to quantify the influence of individual instances in generalization of neural networks. The proposed score can quantify irregularity of the instances and measure how much each data instance contributes in the total movement of the network parameters during training.
Score: 8.89493507314525
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many recent works on understanding deep learning try to quantify how much individual data instances influence the optimization and generalization of a model, either by analyzing the behavior of the model during training or by measuring the performance gap of the model when the instance is removed from the dataset. Such approaches reveal characteristics and importance of individual instances, which may provide useful information in diagnosing and improving deep learning. However, most of the existing works on data valuation require actual training of a model, which often demands high-computational cost. In this paper, we provide a training-free data valuation score, called complexity-gap score, which is a data-centric score to quantify the influence of individual instances in generalization of two-layer overparameterized neural networks. The proposed score can quantify irregularity of the instances and measure how much each data instance contributes in the total movement of the network parameters during training. We theoretically analyze and empirically demonstrate the effectiveness of the complexity-gap score in finding 'irregular or mislabeled' data instances, and also provide applications of the score in analyzing datasets and diagnosing training dynamics.

Related papers

Z-Error Loss for Training Neural Networks [0.0]
Outliers introduce significant training challenges in neural networks by propagating erroneous gradients, which can degrade model performance and generalization.<n>We propose the Z-Error Loss, a statistically principled approach that minimizes outlier influence during training by masking the contribution of data points identified as out-of-distribution within each batch.
arXiv Detail & Related papers (2025-06-02T18:35:30Z)
Capturing the Temporal Dependence of Training Data Influence [100.91355498124527]
We formalize the concept of trajectory-specific leave-one-out influence, which quantifies the impact of removing a data point during training. We propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO. As data value embedding captures training data ordering, it offers valuable insights into model training dynamics.
arXiv Detail & Related papers (2024-12-12T18:28:55Z)
Early-Stage Anomaly Detection: A Study of Model Performance on Complete vs. Partial Flows [0.0]
This study investigates the efficacy of machine learning models, specifically Random Forest, in anomaly detection systems. We explore the performance disparity that arises when models are applied to incomplete data typical in real-world, real-time network environments.
arXiv Detail & Related papers (2024-07-03T07:14:25Z)
Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages. Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z)
The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes [30.30769701138665]
We introduce and explore the Mirrored Influence Hypothesis, highlighting a reciprocal nature of influence between training and test data. Specifically, it suggests that evaluating the influence of training data on test predictions can be reformulated as an equivalent, yet inverse problem. We introduce a new method for estimating the influence of training data, which requires calculating gradients for specific test samples, paired with a forward pass for each training point.
arXiv Detail & Related papers (2024-02-14T03:43:05Z)
Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution [62.71425232332837]
We show that training amortized models with noisy labels is inexpensive and surprisingly effective. This approach significantly accelerates several feature attribution and data valuation methods, often yielding an order of magnitude speedup over existing approaches.
arXiv Detail & Related papers (2024-01-29T03:42:37Z)
Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data. We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations. Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z)
CHALLENGER: Training with Attribution Maps [63.736435657236505]
We show that utilizing attribution maps for training neural networks can improve regularization of models and thus increase performance. In particular, we show that our generic domain-independent approach yields state-of-the-art results in vision, natural language processing and on time series tasks.
arXiv Detail & Related papers (2022-05-30T13:34:46Z)
Zero-shot meta-learning for small-scale data from human subjects [10.320654885121346]
We develop a framework to rapidly adapt to a new prediction task with limited training data for out-of-sample test data. Our model learns the latent treatment effects of each intervention and, by design, can naturally handle multi-task predictions. Our model has implications for improved generalization of small-size human studies to the wider population.
arXiv Detail & Related papers (2022-03-29T17:42:04Z)
Equivariance Allows Handling Multiple Nuisance Variables When Analyzing Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution. We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z)
Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z)
More data or more parameters? Investigating the effect of data structure on generalization [17.249712222764085]
Properties of data impact the test error as a function of the number of training examples and number of training parameters. We show that noise in the labels and strong anisotropy of the input data play similar roles on the test error.
arXiv Detail & Related papers (2021-03-09T16:08:41Z)
Learning from Incomplete Features by Simultaneous Training of Neural Networks and Sparse Coding [24.3769047873156]
This paper addresses the problem of training a classifier on a dataset with incomplete features. We assume that different subsets of features (random or structured) are available at each data instance. A new supervised learning method is developed to train a general classifier, using only a subset of features per sample.
arXiv Detail & Related papers (2020-11-28T02:20:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.