Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value
- URL: http://arxiv.org/abs/2304.07718v3
- Date: Thu, 1 Jun 2023 04:03:02 GMT
- Title: Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value
- Authors: Yongchan Kwon, James Zou
- Abstract summary: We propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate.
Data-OOB takes less than 2.25 hours on a single CPU processor when there are $106$ samples to evaluate and the input dimension is 100.
We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points.
- Score: 17.340091573913316
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Data valuation is a powerful framework for providing statistical insights
into which data are beneficial or detrimental to model training. Many
Shapley-based data valuation methods have shown promising results in various
downstream tasks, however, they are well known to be computationally
challenging as it requires training a large number of models. As a result, it
has been recognized as infeasible to apply to large datasets. To address this
issue, we propose Data-OOB, a new data valuation method for a bagging model
that utilizes the out-of-bag estimate. The proposed method is computationally
efficient and can scale to millions of data by reusing trained weak learners.
Specifically, Data-OOB takes less than 2.25 hours on a single CPU processor
when there are $10^6$ samples to evaluate and the input dimension is 100.
Furthermore, Data-OOB has solid theoretical interpretations in that it
identifies the same important data point as the infinitesimal jackknife
influence function when two different points are compared. We conduct
comprehensive experiments using 12 classification datasets, each with thousands
of sample sizes. We demonstrate that the proposed method significantly
outperforms existing state-of-the-art data valuation methods in identifying
mislabeled data and finding a set of helpful (or harmful) data points,
highlighting the potential for applying data values in real-world applications.
Related papers
- A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Data Distribution Valuation [56.71023681599737]
Existing data valuation methods define a value for a discrete dataset.
In many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled.
We propose a maximum mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies.
arXiv Detail & Related papers (2024-10-06T07:56:53Z) - Is Data Valuation Learnable and Interpretable? [3.9325957466009203]
Current data valuation methods ignore the interpretability of the output values.
This study aims to answer an important question: is data valuation learnable and interpretable?
arXiv Detail & Related papers (2024-06-03T08:13:47Z) - Scaling Laws for the Value of Individual Data Points in Machine Learning [55.596413470429475]
We introduce a new perspective by investigating scaling behavior for the value of individual data points.
We provide learning theory to support our scaling law, and we observe empirically that it holds across diverse model classes.
Our work represents a first step towards understanding and utilizing scaling properties for the value of individual data points.
arXiv Detail & Related papers (2024-05-30T20:10:24Z) - Neural Dynamic Data Valuation [4.286118155737111]
We propose a novel data valuation method from the perspective of optimal control, named the neural dynamic data valuation (NDDV)
Our method has solid theoretical interpretations to accurately identify the data valuation via the sensitivity of the data optimal control state.
In addition, we implement a data re-weighting strategy to capture the unique features of data points, ensuring fairness through the interaction between data points and the mean-field states.
arXiv Detail & Related papers (2024-04-30T13:39:26Z) - LAVA: Data Valuation without Pre-Specified Learning Algorithms [20.578106028270607]
We introduce a new framework that can value training data in a way that is oblivious to the downstream learning algorithm.
We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets.
We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions.
arXiv Detail & Related papers (2023-04-28T19:05:16Z) - Data Budgeting for Machine Learning [17.524791147624086]
We study the data budgeting problem and formulate it as two sub-problems.
We propose a learning method to solve data budgeting problems.
Our empirical evaluation shows that it is possible to perform data budgeting given a small pilot study dataset with as few as $50$ data points.
arXiv Detail & Related papers (2022-10-03T14:53:17Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z) - Self-Trained One-class Classification for Unsupervised Anomaly Detection [56.35424872736276]
Anomaly detection (AD) has various applications across domains, from manufacturing to healthcare.
In this work, we focus on unsupervised AD problems whose entire training data are unlabeled and may contain both normal and anomalous samples.
To tackle this problem, we build a robust one-class classification framework via data refinement.
We show that our method outperforms state-of-the-art one-class classification method by 6.3 AUC and 12.5 average precision.
arXiv Detail & Related papers (2021-06-11T01:36:08Z) - How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets.
Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.