Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits
- URL: http://arxiv.org/abs/2405.03875v1
- Date: Mon, 6 May 2024 21:46:10 GMT
- Title: Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits
- Authors: Jiachen T. Wang, Tianji Yang, James Zou, Yongchan Kwon, Ruoxi Jia,
- Abstract summary: Data Shapley provides a principled approach to data valuation and plays a crucial role in data-centric machine learning (ML) research.
Data selection is considered a standard application of Data Shapley, but its data selection performance has shown to be inconsistent across settings in the literature.
We introduce a hypothesis testing framework and show that Data Shapley's performance can be no better than random selection without specific constraints on utility functions.
- Score: 37.79841753524388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data Shapley provides a principled approach to data valuation and plays a crucial role in data-centric machine learning (ML) research. Data selection is considered a standard application of Data Shapley. However, its data selection performance has shown to be inconsistent across settings in the literature. This study aims to deepen our understanding of this phenomenon. We introduce a hypothesis testing framework and show that Data Shapley's performance can be no better than random selection without specific constraints on utility functions. We identify a class of utility functions, monotonically transformed modular functions, within which Data Shapley optimally selects data. Based on this insight, we propose a heuristic for predicting Data Shapley's effectiveness in data selection tasks. Our experiments corroborate these findings, adding new insights into when Data Shapley may or may not succeed.
Related papers
- Unifying and Optimizing Data Values for Selection via Sequential-Decision-Making [5.755427480127593]
We show that data values applied for selection can be reformulated as a sequential-decision-making problem.
We propose an efficient approximation scheme using learned bipartite graphs as surrogate utility models.
arXiv Detail & Related papers (2025-02-06T23:03:10Z) - LLM-Select: Feature Selection with Large Language Models [64.5099482021597]
Large language models (LLMs) are capable of selecting the most predictive features, with performance rivaling the standard tools of data science.
Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place.
arXiv Detail & Related papers (2024-07-02T22:23:40Z) - CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning [0.0]
We propose the CHG (compound of Hardness and Gradient) utility function, which approximates the utility of each data subset on model performance in every training epoch.
By deriving the closed-form Shapley value for each data point using the CHG utility function, we reduce the computational complexity to that of a single model retraining.
We further leverage CHG Shapley for real-time data selection, conducting experiments across three settings: standard datasets, label noise datasets, and class imbalance datasets.
arXiv Detail & Related papers (2024-06-17T16:48:31Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Experiment Planning with Function Approximation [49.50254688629728]
We study the problem of experiment planning with function approximation in contextual bandit problems.
We propose two experiment planning strategies compatible with function approximation.
We show that a uniform sampler achieves competitive optimality rates in the setting where the number of actions is small.
arXiv Detail & Related papers (2024-01-10T14:40:23Z) - Shapley Value on Probabilistic Classifiers [6.163093930860032]
In the context of machine learning (ML), data valuation methods aim to equitably measure the contribution of each data point to the utility of an ML model.
Traditional Shapley-based data valuation methods may not effectively distinguish between beneficial and detrimental training data points.
We propose Probabilistic Shapley (P-Shapley) value by constructing a probability-wise utility function.
arXiv Detail & Related papers (2023-06-12T15:09:13Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for
Machine Learning [13.66570363867102]
We propose Beta Shapley, which is a substantial generalization of Data Shapley.
Beta Shapley unifies several popular data valuation methods and includes data Shapley as a special case.
We demonstrate that Beta Shapley outperforms state-of-the-art data valuation methods on several downstream ML tasks.
arXiv Detail & Related papers (2021-10-26T22:03:55Z) - A Distributional Framework for Data Valuation [26.065217938868617]
We develop an algorithm for estimating values from data that comes with formal guarantees and runs two orders of magnitude faster than state-of-the-art algorithms.
We apply distributional Shapley to diverse data sets and demonstrate its utility in a data market setting.
arXiv Detail & Related papers (2020-02-27T18:51:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.