Data Overvaluation Attack and Truthful Data Valuation
- URL: http://arxiv.org/abs/2502.00494v2
- Date: Tue, 04 Feb 2025 08:36:53 GMT
- Title: Data Overvaluation Attack and Truthful Data Valuation
- Authors: Shuyuan Zheng, Sudong Cai, Chuan Xiao, Yang Cao, Jianbin Qin, Masatoshi Yoshikawa, Makoto Onizuka,
- Abstract summary: This paper introduces the first data overvaluation attack, enabling strategic clients to have their data significantly overvalued.<n>We propose a truthful data valuation metric, named Truth-Shapley.<n>Our experiments demonstrate the vulnerability of existing data valuation metrics to the data overvaluation attack and validate the robustness and effectiveness of Truth-Shapley.
- Score: 19.974649007968946
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In collaborative machine learning, data valuation, i.e., evaluating the contribution of each client' data to the machine learning model, has become a critical task for incentivizing and selecting positive data contributions. However, existing studies often assume that clients engage in data valuation truthfully, overlooking the practical motivation for clients to exaggerate their contributions. To unlock this threat, this paper introduces the first data overvaluation attack, enabling strategic clients to have their data significantly overvalued. Furthermore, we propose a truthful data valuation metric, named Truth-Shapley. Truth-Shapley is the unique metric that guarantees some promising axioms for data valuation while ensuring that clients' optimal strategy is to perform truthful data valuation. Our experiments demonstrate the vulnerability of existing data valuation metrics to the data overvaluation attack and validate the robustness and effectiveness of Truth-Shapley.
Related papers
- DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets.
Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining.
Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z) - Private, Augmentation-Robust and Task-Agnostic Data Valuation Approach for Data Marketplace [56.78396861508909]
PriArTa is an approach for computing the distance between the distribution of the buyer's existing dataset and the seller's dataset.
PriArTa is communication-efficient, enabling the buyer to evaluate datasets without needing access to the entire dataset from each seller.
arXiv Detail & Related papers (2024-11-01T17:13:14Z) - Proper Dataset Valuation by Pointwise Mutual Information [26.693741797887643]
We propose an information-theoretic framework for evaluating data curation methods.
We compare informativeness by the Shannon mutual information of the evaluated data and the test data.
Experiments on real-world data demonstrate that our mutual information-based evaluation assigns appropriately lower scores to data curation strategies.
arXiv Detail & Related papers (2024-05-28T15:04:17Z) - Lazy Data Practices Harm Fairness Research [49.02318458244464]
We present a comprehensive analysis of fair ML datasets, demonstrating how unreflective practices hinder the reach and reliability of algorithmic fairness findings.
Our analyses identify three main areas of concern: (1) a textbflack of representation for certain protected attributes in both data and evaluations; (2) the widespread textbf of minorities during data preprocessing; and (3) textbfopaque data processing threatening the generalization of fairness research.
This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.
arXiv Detail & Related papers (2024-04-26T09:51:24Z) - Towards Fair, Robust and Efficient Client Contribution Evaluation in
Federated Learning [16.543724155324938]
We introduce a novel method called Fair, Robust, and Efficient Client Assessment (FRECA) for quantifying client contributions in Federated Learning (FL)
FRECA employs a framework called FedTruth to estimate the global model's ground truth update, balancing contributions from all clients while filtering out impacts from malicious ones.
Our experimental results show that FRECA can accurately and efficiently quantify client contributions in a robust manner.
arXiv Detail & Related papers (2024-02-06T21:07:12Z) - Reliability in Semantic Segmentation: Can We Use Synthetic Data? [69.28268603137546]
We show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models.
This synthetic data is employed to evaluate the robustness of pretrained segmenters.
We demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters.
arXiv Detail & Related papers (2023-12-14T18:56:07Z) - When is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? A Data-Centric Perspective [64.73162159837956]
evaluating the value of a hypothetical target policy with only a logged dataset is important but challenging.
We propose DataCOPE, a data-centric framework for evaluating a target policy given a dataset.
Our empirical analysis of DataCOPE in the logged contextual bandit settings using healthcare datasets confirms its ability to evaluate both machine-learning and human expert policies.
arXiv Detail & Related papers (2023-11-23T17:13:37Z) - Data Valuation and Detections in Federated Learning [4.899818550820576]
Federated Learning (FL) enables collaborative model training while preserving the privacy of raw data.
A challenge in this framework is the fair and efficient valuation of data, which is crucial for incentivizing clients to contribute high-quality data in the FL task.
This paper introduces a novel privacy-preserving method for evaluating client contributions and selecting relevant datasets without a pre-specified training algorithm in an FL task.
arXiv Detail & Related papers (2023-11-09T12:01:32Z) - Stop Uploading Test Data in Plain Text: Practical Strategies for
Mitigating Data Contamination by Evaluation Benchmarks [70.39633252935445]
Data contamination has become prevalent and challenging with the rise of models pretrained on large automatically-crawled corpora.
For closed models, the training data becomes a trade secret, and even for open models, it is not trivial to detect contamination.
We propose three strategies that can make a difference: (1) Test data made public should be encrypted with a public key and licensed to disallow derivative distribution; (2) demand training exclusion controls from closed API holders, and protect your test data by refusing to evaluate without them; and (3) avoid data which appears with its solution on the internet, and release the web-page context of internet-derived
arXiv Detail & Related papers (2023-05-17T12:23:38Z) - Auditing and Generating Synthetic Data with Controllable Trust Trade-offs [54.262044436203965]
We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models.
It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation.
We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases.
arXiv Detail & Related papers (2023-04-21T09:03:18Z) - Assessment of creditworthiness models privacy-preserving training with
synthetic data [4.014524824655106]
We evaluate the performance of models trained with synthetic data when applied to real-world data.
creditworthiness assessment models trained with synthetic data show a reduction of 3% of AUC and 6% of KS when compared with models trained with real data.
arXiv Detail & Related papers (2022-12-31T19:13:14Z) - Statistical Dataset Evaluation: Reliability, Difficulty, and Validity [18.36931975072938]
We propose a model-agnostic dataset evaluation framework for automatic dataset quality evaluation.
We seek the statistical properties of the datasets and address three fundamental dimensions: reliability, difficulty, and validity.
arXiv Detail & Related papers (2022-12-19T06:55:42Z) - Data Poisoning Attacks and Defenses to Crowdsourcing Systems [26.147716118854614]
We show that crowdsourcing is vulnerable to data poisoning attacks.
malicious clients provide carefully crafted data to corrupt the aggregated data.
We propose two defenses to reduce the impact of malicious clients.
arXiv Detail & Related papers (2021-02-18T06:03:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.