LAVA: Data Valuation without Pre-Specified Learning Algorithms
- URL: http://arxiv.org/abs/2305.00054v3
- Date: Tue, 19 Dec 2023 20:14:51 GMT
- Title: LAVA: Data Valuation without Pre-Specified Learning Algorithms
- Authors: Hoang Anh Just, Feiyang Kang, Jiachen T. Wang, Yi Zeng, Myeongseob Ko,
Ming Jin, Ruoxi Jia
- Abstract summary: We introduce a new framework that can value training data in a way that is oblivious to the downstream learning algorithm.
We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets.
We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions.
- Score: 20.578106028270607
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditionally, data valuation (DV) is posed as a problem of equitably
splitting the validation performance of a learning algorithm among the training
data. As a result, the calculated data values depend on many design choices of
the underlying learning algorithm. However, this dependence is undesirable for
many DV use cases, such as setting priorities over different data sources in a
data acquisition process and informing pricing mechanisms in a data
marketplace. In these scenarios, data needs to be valued before the actual
analysis and the choice of the learning algorithm is still undetermined then.
Another side-effect of the dependence is that to assess the value of individual
points, one needs to re-run the learning algorithm with and without a point,
which incurs a large computation burden. This work leapfrogs over the current
limits of data valuation methods by introducing a new framework that can value
training data in a way that is oblivious to the downstream learning algorithm.
Our main results are as follows. (1) We develop a proxy for the validation
performance associated with a training set based on a non-conventional
class-wise Wasserstein distance between training and validation sets. We show
that the distance characterizes the upper bound of the validation performance
for any given model under certain Lipschitz conditions. (2) We develop a novel
method to value individual data based on the sensitivity analysis of the
class-wise Wasserstein distance. Importantly, these values can be directly
obtained for free from the output of off-the-shelf optimization solvers when
computing the distance. (3) We evaluate our new data valuation framework over
various use cases related to detecting low-quality data and show that,
surprisingly, the learning-agnostic feature of our framework enables a
significant improvement over SOTA performance while being orders of magnitude
faster.
Related papers
- Fine-tuning can Help Detect Pretraining Data from Large Language Models [7.7209640786782385]
Current methods differentiate members and non-members by designing scoring functions, like Perplexity and Min-k%.
We introduce a novel and effective method termed Fine-tuned Score Deviation (FSD), which improves the performance of current scoring functions for pretraining data detection.
arXiv Detail & Related papers (2024-10-09T15:36:42Z) - Neural Dynamic Data Valuation [4.286118155737111]
We propose a novel data valuation method from the perspective of optimal control, named the neural dynamic data valuation (NDDV)
Our method has solid theoretical interpretations to accurately identify the data valuation via the sensitivity of the data optimal control state.
In addition, we implement a data re-weighting strategy to capture the unique features of data points, ensuring fairness through the interaction between data points and the mean-field states.
arXiv Detail & Related papers (2024-04-30T13:39:26Z) - OpenDataVal: a Unified Benchmark for Data Valuation [38.15852021170501]
We introduce OpenDataVal, an easy-to-use and unified benchmark framework for data valuation.
OpenDataVal provides an integrated environment that includes eleven different state-of-the-art data valuation algorithms.
We perform benchmarking analysis using OpenDataVal, quantifying and comparing the efficacy of state-of-the-art data valuation approaches.
arXiv Detail & Related papers (2023-06-18T14:38:29Z) - Building Manufacturing Deep Learning Models with Minimal and Imbalanced
Training Data Using Domain Adaptation and Data Augmentation [15.333573151694576]
We propose a novel domain adaptation (DA) approach to address the problem of labeled training data scarcity for a target learning task.
Our approach works for scenarios where the source dataset and the dataset available for the target learning task have same or different feature spaces.
We evaluate our combined approach using image data for wafer defect prediction.
arXiv Detail & Related papers (2023-05-31T21:45:34Z) - A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity [84.6421260559093]
This study is the largest set of experiments to validate, quantify, and expose undocumented intuitions about text pretraining.
Our findings indicate there does not exist a one-size-fits-all solution to filtering training data.
arXiv Detail & Related papers (2023-05-22T15:57:53Z) - CheckSel: Efficient and Accurate Data-valuation Through Online
Checkpoint Selection [3.321404824316694]
We propose a novel 2-phase solution to the problem of data valuation and subset selection.
Phase 1 selects representative checkpoints from an SGD-like training algorithm, which are used in phase-2 to estimate the approximate training data values.
Experimental results show the proposed algorithm outperforms recent baseline methods by up to 30% in terms of test accuracy.
arXiv Detail & Related papers (2022-03-14T02:06:52Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z) - Can Active Learning Preemptively Mitigate Fairness Issues? [66.84854430781097]
dataset bias is one of the prevailing causes of unfairness in machine learning.
We study whether models trained with uncertainty-based ALs are fairer in their decisions with respect to a protected class.
We also explore the interaction of algorithmic fairness methods such as gradient reversal (GRAD) and BALD.
arXiv Detail & Related papers (2021-04-14T14:20:22Z) - Provably Efficient Causal Reinforcement Learning with Confounded
Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting.
We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z) - How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance.
We formulate a quality measure for the data set, which we refer to as $rho$-gap.
We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.